Term Tokenisation Rules

The rules for turning a piece of text into terms when indexing and for turning a query string into terms when searching are very similar.

Abbreviation/Acronyms/Initialisms

A sequence of two or more Unicode upper case letters each separated by . is handled specially to better handle abbreviations, acronyms and initialisms. So P.T.O is handled the same as PTO. A trailing . is allowed but not required.

Infix Characters

Some punctuation characters may be present within a term (but not at the start/end). Each such infix character must have a “word character” before and after it. Most infix characters are included in the term, but a few are skipped:

Codepoint

Description

Notes

U+00AD U+200B U+200C U+200D U+2060 U+FEFF

SOFT HYPHEN ZERO WIDTH SPACE ZERO WIDTH NON-JOINER ZERO WIDTH JOINER WORD JOINER ZERO WIDTH NO-BREAK SPACE

Since 2.0.0 < 2.0.0 only

Infix Characters for Numbers

If both preceded and followed by a Unicode digit then the following infix characters are allowed:

Codepoint

Description

Char

U+002C U+002E U+003B U+037E U+0589 U+060D U+07F8 U+2044 U+FE10 U+FE13 U+FE14

COMMA FULL STOP SEMICOLON GREEK QUESTION MARK ARMENIAN FULL STOP ARABIC DATE SEPARATOR NKO COMMA FRACTION SLASH PRESENTATION FORM FOR VERTICAL COMMA PRESENTATION FORM FOR VERTICAL COLON PRESENTATION FORM FOR VERTICAL SEMICOLON

, . : ; ։ ؍ ߸

Infix Characters for Words

Otherwise the following infix characters are allowed (this list is based on the Unicode word boundary rules):

Codepoint

Description

Char

Replacement

U+0026 U+0027 U+00B7 U+2019 U+201B U+2027

AMPERSAND APOSTROPHE MIDDLE DOT RIGHT SINGLE QUOTATION MARK SINGLE HIGH-REVERSED-9 QUOTATION MARK HYPHENATION POINT

& ·

The various apostrophe characters (, U+2019 and U+201B) are all normalised to .

Suffix Characters

Up to 3 suffix characters are allowed and included on the end of the word. The current suffix characters are + and #.

Phrase Generators

A group of terms each separated by one or more of the following punctuation characters is handled as a phrase search: .-/:@

So for example, joe-blogs@example.org becomes a phrase search for the terms joe, blogs, example and org.