writing module

Writer

class whoosh.writing.IndexWriter[source]

High-level object for writing to an index.

To get a writer for a particular index, call writer() on the Index object.

>>> writer = myindex.writer()

You can use this object as a context manager. If an exception is thrown from within the context it calls cancel() to clean up temporary files, otherwise it calls commit() when the context exits.

>>> with myindex.writer() as w:
...     w.add_document(title="First document", content="Hello there.")
...     w.add_document(title="Second document", content="This is easy!")
abstract add_document(**fields)[source]

The keyword arguments map field names to the values to index/store:

w = myindex.writer()
w.add_document(path=u"/a", title=u"First doc", text=u"Hello")
w.commit()

Depending on the field type, some fields may take objects other than unicode strings. For example, NUMERIC fields take numbers, and DATETIME fields take datetime.datetime objects:

from datetime import datetime, timedelta
from whoosh import index
from whoosh.fields import Schema, DATETIME, NUMERIC, TEXT

schema = Schema(date=DATETIME, size=NUMERIC(float), content=TEXT)
myindex = index.create_in("indexdir", schema)

w = myindex.writer()
w.add_document(date=datetime.now(), size=5.5, content=u"Hello")
w.commit()

Instead of a single object (i.e., unicode string, number, or datetime), you can supply a list or tuple of objects. For unicode strings, this bypasses the field’s analyzer. For numbers and dates, this lets you add multiple values for the given field:

date1 = datetime.now()
date2 = datetime(2005, 12, 25)
date3 = datetime(1999, 1, 1)
w.add_document(date=[date1, date2, date3], size=[9.5, 10],
               content=[u"alfa", u"bravo", u"charlie"])

For fields that are both indexed and stored, you can specify an alternate value to store using a keyword argument in the form “_stored_<fieldname>”. For example, if you have a field named “title” and you want to index the text “a b c” but store the text “e f g”, use keyword arguments like this:

writer.add_document(title=u"a b c", _stored_title=u"e f g")

You can boost the weight of all terms in a certain field by specifying a _<fieldname>_boost keyword argument. For example, if you have a field named “content”, you can double the weight of this document for searches in the “content” field like this:

writer.add_document(content="a b c", _title_boost=2.0)

You can boost every field at once using the _boost keyword. For example, to boost fields “a” and “b” by 2.0, and field “c” by 3.0:

writer.add_document(a="alfa", b="bravo", c="charlie",
                    _boost=2.0, _c_boost=3.0)

Note that some scoring algroithms, including Whoosh’s default BM25F, do not work with term weights less than 1, so you should generally not use a boost factor less than 1.

See also Writer.update_document().

add_field(fieldname, fieldtype, **kwargs)[source]

Adds a field to the index’s schema.

Parameters:
cancel()[source]

Cancels any documents/deletions added by this object and unlocks the index.

commit()[source]

Finishes writing and unlocks the index.

delete_by_query(q, searcher=None)[source]

Deletes any documents matching a query object.

Returns:

the number of documents deleted.

delete_by_term(fieldname, text, searcher=None)[source]

Deletes any documents containing “term” in the “fieldname” field. This is useful when you have an indexed field containing a unique ID (such as “pathname”) for each document.

Returns:

the number of documents deleted.

abstract delete_document(docnum, delete=True)[source]

Deletes a document by number.

end_group()[source]

Finish indexing a group of hierarchical documents. See start_group().

group()[source]

Returns a context manager that calls start_group() and end_group() for you, allowing you to use a with statement to group hierarchical documents:

with myindex.writer() as w:
    with w.group():
        w.add_document(kind="class", name="Accumulator")
        w.add_document(kind="method", name="add")
        w.add_document(kind="method", name="get_result")
        w.add_document(kind="method", name="close")

    with w.group():
        w.add_document(kind="class", name="Calculator")
        w.add_document(kind="method", name="add")
        w.add_document(kind="method", name="multiply")
        w.add_document(kind="method", name="get_result")
        w.add_document(kind="method", name="close")
abstract reader(**kwargs)[source]

Returns a reader for the existing index.

remove_field(fieldname, **kwargs)[source]

Removes the named field from the index’s schema. Depending on the backend implementation, this may or may not actually remove existing data for the field from the index. Optimizing the index should always clear out existing data for a removed field.

start_group()[source]

Start indexing a group of hierarchical documents. The backend should ensure that these documents are all added to the same segment:

with myindex.writer() as w:
    w.start_group()
    w.add_document(kind="class", name="Accumulator")
    w.add_document(kind="method", name="add")
    w.add_document(kind="method", name="get_result")
    w.add_document(kind="method", name="close")
    w.end_group()

    w.start_group()
    w.add_document(kind="class", name="Calculator")
    w.add_document(kind="method", name="add")
    w.add_document(kind="method", name="multiply")
    w.add_document(kind="method", name="get_result")
    w.add_document(kind="method", name="close")
    w.end_group()

A more convenient way to group documents is to use the group() method and the with statement.

update_document(**fields)[source]

The keyword arguments map field names to the values to index/store.

This method adds a new document to the index, and automatically deletes any documents with the same values in any fields marked “unique” in the schema:

schema = fields.Schema(path=fields.ID(unique=True, stored=True),
                       content=fields.TEXT)
myindex = index.create_in("index", schema)

w = myindex.writer()
w.add_document(path=u"/", content=u"Mary had a lamb")
w.commit()

w = myindex.writer()
w.update_document(path=u"/", content=u"Mary had a little lamb")
w.commit()

assert myindex.doc_count() == 1

It is safe to use update_document in place of add_document; if there is no existing document to replace, it simply does an add.

You cannot currently pass a list or tuple of values to a “unique” field.

Because this method has to search for documents with the same unique fields and delete them before adding the new document, it is slower than using add_document.

  • Marking more fields “unique” in the schema will make each update_document call slightly slower.

  • When you are updating multiple documents, it is faster to batch delete all changed documents and then use add_document to add the replacements instead of using update_document.

Note that this method will only replace a committed document; currently it cannot replace documents you’ve added to the IndexWriter but haven’t yet committed. For example, if you do this:

>>> writer.update_document(unique_id=u"1", content=u"Replace me")
>>> writer.update_document(unique_id=u"1", content=u"Replacement")

…this will add two documents with the same value of unique_id, instead of the second document replacing the first.

See Writer.add_document() for information on _stored_<fieldname>, _<fieldname>_boost, and _boost keyword arguments.

Utility writers

class whoosh.writing.BufferedWriter(index, period=60, limit=10, writerargs=None, commitargs=None)[source]

Convenience class that acts like a writer but buffers added documents before dumping the buffered documents as a batch into the actual index.

In scenarios where you are continuously adding single documents very rapidly (for example a web application where lots of users are adding content simultaneously), using a BufferedWriter is much faster than opening and committing a writer for each document you add. If you’re adding batches of documents at a time, you can just use a regular writer.

(This class may also be useful for batches of update_document calls. In a normal writer, update_document calls cannot update documents you’ve added in that writer. With BufferedWriter, this will work.)

To use this class, create it from your index and keep it open, sharing it between threads.

>>> from whoosh.writing import BufferedWriter
>>> writer = BufferedWriter(myindex, period=120, limit=20)
>>> # Then you can use the writer to add and update documents
>>> writer.add_document(...)
>>> writer.add_document(...)
>>> writer.add_document(...)
>>> # Before the writer goes out of scope, call close() on it
>>> writer.close()

Note

This object stores documents in memory and may keep an underlying writer open, so you must explicitly call the close() method on this object before it goes out of scope to release the write lock and make sure any uncommitted changes are saved.

You can read/search the combination of the on-disk index and the buffered documents in memory by calling BufferedWriter.reader() or BufferedWriter.searcher(). This allows quasi-real-time search, where documents are available for searching as soon as they are buffered in memory, before they are committed to disk.

Tip

By using a searcher from the shared writer, multiple threads can search the buffered documents. Of course, other processes will only see the documents that have been written to disk. If you want indexed documents to become available to other processes as soon as possible, you have to use a traditional writer instead of a BufferedWriter.

You can control how often the BufferedWriter flushes the in-memory index to disk using the period and limit arguments. period is the maximum number of seconds between commits. limit is the maximum number of additions to buffer between commits.

You don’t need to call commit() on the BufferedWriter manually. Doing so will just flush the buffered documents to disk early. You can continue to make changes after calling commit(), and you can call commit() multiple times.

Parameters:
  • index – the whoosh.index.Index to write to.

  • period – the maximum amount of time (in seconds) between commits. Set this to 0 or None to not use a timer. Do not set this any lower than a few seconds.

  • limit – the maximum number of documents to buffer before committing.

  • writerargs – dictionary specifying keyword arguments to be passed to the index’s writer() method when creating a writer.

add_document(**fields)[source]

The keyword arguments map field names to the values to index/store:

w = myindex.writer()
w.add_document(path=u"/a", title=u"First doc", text=u"Hello")
w.commit()

Depending on the field type, some fields may take objects other than unicode strings. For example, NUMERIC fields take numbers, and DATETIME fields take datetime.datetime objects:

from datetime import datetime, timedelta
from whoosh import index
from whoosh.fields import Schema, DATETIME, NUMERIC, TEXT

schema = Schema(date=DATETIME, size=NUMERIC(float), content=TEXT)
myindex = index.create_in("indexdir", schema)

w = myindex.writer()
w.add_document(date=datetime.now(), size=5.5, content=u"Hello")
w.commit()

Instead of a single object (i.e., unicode string, number, or datetime), you can supply a list or tuple of objects. For unicode strings, this bypasses the field’s analyzer. For numbers and dates, this lets you add multiple values for the given field:

date1 = datetime.now()
date2 = datetime(2005, 12, 25)
date3 = datetime(1999, 1, 1)
w.add_document(date=[date1, date2, date3], size=[9.5, 10],
               content=[u"alfa", u"bravo", u"charlie"])

For fields that are both indexed and stored, you can specify an alternate value to store using a keyword argument in the form “_stored_<fieldname>”. For example, if you have a field named “title” and you want to index the text “a b c” but store the text “e f g”, use keyword arguments like this:

writer.add_document(title=u"a b c", _stored_title=u"e f g")

You can boost the weight of all terms in a certain field by specifying a _<fieldname>_boost keyword argument. For example, if you have a field named “content”, you can double the weight of this document for searches in the “content” field like this:

writer.add_document(content="a b c", _title_boost=2.0)

You can boost every field at once using the _boost keyword. For example, to boost fields “a” and “b” by 2.0, and field “c” by 3.0:

writer.add_document(a="alfa", b="bravo", c="charlie",
                    _boost=2.0, _c_boost=3.0)

Note that some scoring algroithms, including Whoosh’s default BM25F, do not work with term weights less than 1, so you should generally not use a boost factor less than 1.

See also Writer.update_document().

commit(restart=True)[source]

Finishes writing and unlocks the index.

delete_document(docnum, delete=True)[source]

Deletes a document by number.

reader(**kwargs)[source]

Returns a reader for the existing index.

update_document(**fields)[source]

The keyword arguments map field names to the values to index/store.

This method adds a new document to the index, and automatically deletes any documents with the same values in any fields marked “unique” in the schema:

schema = fields.Schema(path=fields.ID(unique=True, stored=True),
                       content=fields.TEXT)
myindex = index.create_in("index", schema)

w = myindex.writer()
w.add_document(path=u"/", content=u"Mary had a lamb")
w.commit()

w = myindex.writer()
w.update_document(path=u"/", content=u"Mary had a little lamb")
w.commit()

assert myindex.doc_count() == 1

It is safe to use update_document in place of add_document; if there is no existing document to replace, it simply does an add.

You cannot currently pass a list or tuple of values to a “unique” field.

Because this method has to search for documents with the same unique fields and delete them before adding the new document, it is slower than using add_document.

  • Marking more fields “unique” in the schema will make each update_document call slightly slower.

  • When you are updating multiple documents, it is faster to batch delete all changed documents and then use add_document to add the replacements instead of using update_document.

Note that this method will only replace a committed document; currently it cannot replace documents you’ve added to the IndexWriter but haven’t yet committed. For example, if you do this:

>>> writer.update_document(unique_id=u"1", content=u"Replace me")
>>> writer.update_document(unique_id=u"1", content=u"Replacement")

…this will add two documents with the same value of unique_id, instead of the second document replacing the first.

See Writer.add_document() for information on _stored_<fieldname>, _<fieldname>_boost, and _boost keyword arguments.

class whoosh.writing.AsyncWriter(index, delay=0.25, writerargs=None)[source]

Convenience wrapper for a writer object that might fail due to locking (i.e. the filedb writer). This object will attempt once to obtain the underlying writer, and if it’s successful, will simply pass method calls on to it.

If this object can’t obtain a writer immediately, it will buffer delete, add, and update method calls in memory until you call commit(). At that point, this object will start running in a separate thread, trying to obtain the writer over and over, and once it obtains it, “replay” all the buffered method calls on it.

In a typical scenario where you’re adding a single or a few documents to the index as the result of a Web transaction, this lets you just create the writer, add, and commit, without having to worry about index locks, retries, etc.

For example, to get an aynchronous writer, instead of this:

>>> writer = myindex.writer()

Do this:

>>> from whoosh.writing import AsyncWriter
>>> writer = AsyncWriter(myindex)
Parameters:
  • index – the whoosh.index.Index to write to.

  • delay – the delay (in seconds) between attempts to instantiate the actual writer.

  • writerargs – an optional dictionary specifying keyword arguments to to be passed to the index’s writer() method.

add_document(*args, **kwargs)[source]

The keyword arguments map field names to the values to index/store:

w = myindex.writer()
w.add_document(path=u"/a", title=u"First doc", text=u"Hello")
w.commit()

Depending on the field type, some fields may take objects other than unicode strings. For example, NUMERIC fields take numbers, and DATETIME fields take datetime.datetime objects:

from datetime import datetime, timedelta
from whoosh import index
from whoosh.fields import Schema, DATETIME, NUMERIC, TEXT

schema = Schema(date=DATETIME, size=NUMERIC(float), content=TEXT)
myindex = index.create_in("indexdir", schema)

w = myindex.writer()
w.add_document(date=datetime.now(), size=5.5, content=u"Hello")
w.commit()

Instead of a single object (i.e., unicode string, number, or datetime), you can supply a list or tuple of objects. For unicode strings, this bypasses the field’s analyzer. For numbers and dates, this lets you add multiple values for the given field:

date1 = datetime.now()
date2 = datetime(2005, 12, 25)
date3 = datetime(1999, 1, 1)
w.add_document(date=[date1, date2, date3], size=[9.5, 10],
               content=[u"alfa", u"bravo", u"charlie"])

For fields that are both indexed and stored, you can specify an alternate value to store using a keyword argument in the form “_stored_<fieldname>”. For example, if you have a field named “title” and you want to index the text “a b c” but store the text “e f g”, use keyword arguments like this:

writer.add_document(title=u"a b c", _stored_title=u"e f g")

You can boost the weight of all terms in a certain field by specifying a _<fieldname>_boost keyword argument. For example, if you have a field named “content”, you can double the weight of this document for searches in the “content” field like this:

writer.add_document(content="a b c", _title_boost=2.0)

You can boost every field at once using the _boost keyword. For example, to boost fields “a” and “b” by 2.0, and field “c” by 3.0:

writer.add_document(a="alfa", b="bravo", c="charlie",
                    _boost=2.0, _c_boost=3.0)

Note that some scoring algroithms, including Whoosh’s default BM25F, do not work with term weights less than 1, so you should generally not use a boost factor less than 1.

See also Writer.update_document().

add_field(*args, **kwargs)[source]

Adds a field to the index’s schema.

Parameters:
cancel(*args, **kwargs)[source]

Cancels any documents/deletions added by this object and unlocks the index.

commit(*args, **kwargs)[source]

Finishes writing and unlocks the index.

delete_by_term(*args, **kwargs)[source]

Deletes any documents containing “term” in the “fieldname” field. This is useful when you have an indexed field containing a unique ID (such as “pathname”) for each document.

Returns:

the number of documents deleted.

delete_document(*args, **kwargs)[source]

Deletes a document by number.

reader()[source]

Returns a reader for the existing index.

remove_field(*args, **kwargs)[source]

Removes the named field from the index’s schema. Depending on the backend implementation, this may or may not actually remove existing data for the field from the index. Optimizing the index should always clear out existing data for a removed field.

run()[source]

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

update_document(*args, **kwargs)[source]

The keyword arguments map field names to the values to index/store.

This method adds a new document to the index, and automatically deletes any documents with the same values in any fields marked “unique” in the schema:

schema = fields.Schema(path=fields.ID(unique=True, stored=True),
                       content=fields.TEXT)
myindex = index.create_in("index", schema)

w = myindex.writer()
w.add_document(path=u"/", content=u"Mary had a lamb")
w.commit()

w = myindex.writer()
w.update_document(path=u"/", content=u"Mary had a little lamb")
w.commit()

assert myindex.doc_count() == 1

It is safe to use update_document in place of add_document; if there is no existing document to replace, it simply does an add.

You cannot currently pass a list or tuple of values to a “unique” field.

Because this method has to search for documents with the same unique fields and delete them before adding the new document, it is slower than using add_document.

  • Marking more fields “unique” in the schema will make each update_document call slightly slower.

  • When you are updating multiple documents, it is faster to batch delete all changed documents and then use add_document to add the replacements instead of using update_document.

Note that this method will only replace a committed document; currently it cannot replace documents you’ve added to the IndexWriter but haven’t yet committed. For example, if you do this:

>>> writer.update_document(unique_id=u"1", content=u"Replace me")
>>> writer.update_document(unique_id=u"1", content=u"Replacement")

…this will add two documents with the same value of unique_id, instead of the second document replacing the first.

See Writer.add_document() for information on _stored_<fieldname>, _<fieldname>_boost, and _boost keyword arguments.

Exceptions

exception whoosh.writing.IndexingError[source]