Databases¶
Table of contents
Pretty much all Xapian operations revolve around a Xapian database. Before searches can be performed, details of the documents to be searched need to be put into a database; the search process then refers to the database to efficiently determine the best matches for a given query. The process of putting documents into the database is usually referred to as indexing.
The main information stored in a database is a mapping from each term to a list of all the documents it occurs in, together with various statistics about these occurrences. It may also store the full text, or extracts, from the documents, so that result summaries can be displayed. Databases can also contain additional data for spelling correction and synonym expansion, and developers can even store arbitrary key-value pairs in part of the database.
Backends¶
Xapian databases store data in custom formats which allow searches to be performed extremely quickly; Xapian does not use a relational database as its datastore. There are several database backends; the main backend in the 1.4 release series of Xapian is called the Glass backend. This stores information in the filesystem (under a given path).
It is possible to perform searches across multiple databases at once, and Xapian will handle merging the results together appropriately. This feature can be combined with remote databases to handle datasets which are too large for a single machine, by performing searches across multiple remote databases.
Todo
Document using add_database() to achieve this
Todo
trunk supports writable multi databases
Todo
mapping of docids
On-disk databases¶
As mentioned, Xapian 1.4 has a default database type called Glass; earlier formats can be upgraded using Xapian’s copydatabase utility. When opening an existing database, Xapian will automatically figure out the backend to use.
If you’re familiar with data storage structures, you might be interested to know that both Chert and Glass use a copy-on-write B+-tree structure - but don’t worry if that doesn’t mean anything to you!
Stub database files¶
Xapian supports a simple text file format for listing the locations of a set of databases (either on the local file system, or remote databases). Such files are called stub-databases, and can be used to point to a database when the physical database location may vary; for example, because a new database is being built nightly, and is named according to the date on which it was built.
These files are recognised by the autodetection in the Database
constructor or you can open them explicitly using
Xapian::DB_BACKEND_STUB
.
If the path provided to the Database constructor is a directory
containing a file called XAPIANDB
, such XAPIANDB
file is
considered to be the stub database file.
The stub database format specifies one database per line, prefixed by the type. For example:
remote localhost:23876
auto /var/spool/xapian/webindex
This way you can have a pre-canned sets of databases to search.
Using such stub files you can swap databases atomically (with a file renaming) in a production environment without having to worry about race conditions. For example, if you want to rebuild the database from scratch and replace it, you can build the database using a new directory, prepare a stub file with the new path, and finally move the stub file over the one which the running code is using.
This technique is better than just replacing the database directory, which is affected by race conditions.
Database types¶
The current types understood by Xapian are:
auto
This isn’t an actual database format, but rather auto-detection of one of the disk based backends (e.g. “chert” or “glass”). It takes a single specified path (which can be to a file or directory) as argument:
auto /var/spool/xapian/webindex
glass
- Glass is the default backend in Xapian 1.4.x. It supports
incremental modifications, concurrent single-writer and
multiple-reader access to a database. It’s very efficient and
highly scalable, and more compact than chert. It takes a path as
argument like
auto
. chert
- Chert was the default backend in Xapian 1.2.x. It supports
incremental modifications, concurrent single-writer and
multiple-reader access to a database. It’s very efficient and
highly scalable. It takes a path as argument like
auto
. inmemory
- This type is a database held entirely in memory. It was originally written for testing purposes only, but may prove useful for building up temporary small databases.
remote
This can specify either a “program” or TCP remote backend, for example:
remote :ssh xapian-prog.example.com xapian-progsrv /srv/xapian/db1
or:
remote xapian-tcp.example.com:12345
If the first character of the second word is a colon (:), then this is skipped and the remainder of the line is used as the command to run xapian-progsrv and the “program” variant of the remote backend is used. Otherwise the TCP variant of the remote backend is used, and the rest of the line specifies the host and port to connect to.
Todo
uses e.g. keeping latest changes in a small DB you merge periodically
In-memory databases¶
Xapian has an inmemory database type, which may be useful for testing and perhaps some short-term usage. However it is inefficient, and does not support all of Xapian’s features (such as spelling correction, synonyms or replication), so for production systems it is often better to use an on-disk database such as Glass, with the files stored in a RAM disk.
Remote databases and replication¶
Xapian’s remote database backend allows the database to be located on a different machine and accessed via a custom protocol.
There is also special support for replicating databases to multiple machines, such that only the parts of the database which have been modified are copied; this can be useful for redundancy and load-balancing purposes.
Concurrent access¶
Most backend formats (and certainly the main backend format for each release)
will allow updates to be grouped into transactions, and will allow at least some
old versions of the database to be searched while new ones are being written.
Currently, all the backends only support a single writer existing at a given
time; attempting to open another writer on the same database will throw
xapian.DatabaseLockError
to indicate that it wasn’t possible to acquire a
lock. Multiple concurrent readers are supported (in addition to the writer).
When a database is opened for reading, a fixed snapshot of the database is
referenced by the reader, (essentially Multi-Version Concurrency Control).
Updates which are made to the database will not be visible to the reader unless
it calls xapian.Database.reopen()
. If the reader is already reading
the latest committed version of the database then
reopen()
has no effect and is a cheap operation, so if
you are reusing the same xapian.Database
object for multiple searches
then it is a reasonable strategy to call reopen()
prior
to each search.
Currently Xapian’s disk based backends have a limitation to their multi-version
concurrency implementation - specifically, at most two versions can exist
concurrently. Therefore a reader will be able to access its snapshot of the
database without limitations when only one change has been made and committed by
the writer, but after the writer has made two changes, readers will receive a
xapian.DatabaseModifiedError
if they attempt to access a part of the database
which has changed. In this situation, the reader can be updated to the latest
version using the xapian.Database.reopen()
method.
Locking¶
With the disk-based Xapian backends, when a database is opened for writing, a lock is obtained on the database to ensure that no further writers are opened concurrently. This lock will be released when the database writer is closed (or automatically if the writer process dies).
One unusual feature of Xapian’s locking mechanism (at least on POSIX operating systems other than Linux) is that Xapian forks a subprocess to hold the lock, rather than holding it in the main process. This is to avoid the lock being accidentally released due to the slightly unhelpful semantics of fcntl locks. Linux kernel 3.15 added new OFD fcntl locks which have more helpful semantics which Xapian uses in preference, avoiding the need to fork a subprocess to hold the lock.