Updating the database¶
If you look back at the verifying step of the database, you may notice that the first item we have indexed has the word ‘compass’ spelled incorrectly, which means that we will need to either update just that document, or to re-index the entire database.
Reindexing the database can be done immediately using the index1.py script
we used for the initial indexing; this is because we are using an external
ID for each document we add to the database, taken from the id_NUMBER
field from the original data set. We then pass this to the
method, which updates if there’s already a document under that external ID,
or adds a document to the database otherwise.
In fact, because of this, index1.py can update just part of the database. Just give it a file with only the rows that correspond to documents that need updating. Everything else in the database will be left untouched.
It is also possible to delete documents from the index using the
xapian.Database.delete_document() method on a
xapian.WritableDatabase object. This can be done either by Xapian docid
or using unique ID terms, as with
def delete_docs(dbpath, identifiers): # Open the database we're going to be deleting from. db = xapian.WritableDatabase(dbpath, xapian.DB_OPEN) for identifier in identifiers: idterm = u'Q' + identifier db.delete_document(idterm)
A copy of this code is available in code/python/delete1.py.
Then we just run our deletion tool, giving it identifiers taken from the id_NUMBER field in the data set:
$ python2 code/python/delete1.py db 1953-448 1985-438
After that, we expect to see two fewer documents in our database using xapian-delve:
$ xapian-delve db UUID = 1820ef0a-055b-4946-ae73-67aa4ef5c226 number of documents = 98 average document length = 100.041 document length lower bound = 33 document length upper bound = 251 highest document id ever used = 100 has positional information = true