Term Tokenisation Rules¶
The rules for turning a piece of text into terms when indexing and for turning a query string into terms when searching are very similar.
Abbreviation/Acronyms/Initialisms¶
A sequence of two or more Unicode upper case letters each separated by . is handled specially to better handle abbreviations, acronyms and initialisms. So P.T.O is handled the same as PTO. A trailing . is allowed but not required.
Infix Characters¶
Some punctuation characters may be present within a term (but not at the start/end). Each such infix character must have a “word character” before and after it. Most infix characters are included in the term, but a few are skipped:
Codepoint |
Description |
Notes |
U+00AD U+200B U+200C U+200D U+2060 U+FEFF |
SOFT HYPHEN ZERO WIDTH SPACE ZERO WIDTH NON-JOINER ZERO WIDTH JOINER WORD JOINER ZERO WIDTH NO-BREAK SPACE |
Since 2.0.0 < 2.0.0 only |
Infix Characters for Numbers¶
If both preceded and followed by a Unicode digit then the following infix characters are allowed:
Codepoint |
Description |
Char |
U+002C U+002E U+003B U+037E U+0589 U+060D U+07F8 U+2044 U+FE10 U+FE13 U+FE14 |
COMMA FULL STOP SEMICOLON GREEK QUESTION MARK ARMENIAN FULL STOP ARABIC DATE SEPARATOR NKO COMMA FRACTION SLASH PRESENTATION FORM FOR VERTICAL COMMA PRESENTATION FORM FOR VERTICAL COLON PRESENTATION FORM FOR VERTICAL SEMICOLON |
, . : ; ։ ؍ ߸ ⁄ ︐ ︓ ︔ |
Infix Characters for Words¶
Otherwise the following infix characters are allowed (this list is based on the Unicode word boundary rules):
Codepoint |
Description |
Char |
Replacement |
U+0026 U+0027 U+00B7 U+2019 U+201B U+2027 |
AMPERSAND APOSTROPHE MIDDLE DOT RIGHT SINGLE QUOTATION MARK SINGLE HIGH-REVERSED-9 QUOTATION MARK HYPHENATION POINT |
& ‘ · ’ ‛ ‧ |
‘ ‘ |
The various apostrophe characters (’, U+2019 and U+201B) are all normalised to ‘.
Suffix Characters¶
Up to 3 suffix characters are allowed and included on the end of the word. The current suffix characters are + and #.
Phrase Generators¶
A group of terms each separated by one or more of the following punctuation characters is handled as a phrase search: .-/:@
So for example, joe-blogs@example.org becomes a phrase search for the terms joe, blogs, example and org.