writing
module¶
Writer¶
- class whoosh.writing.IndexWriter[source]¶
High-level object for writing to an index.
To get a writer for a particular index, call
writer()
on the Index object.>>> writer = myindex.writer()
You can use this object as a context manager. If an exception is thrown from within the context it calls
cancel()
to clean up temporary files, otherwise it callscommit()
when the context exits.>>> with myindex.writer() as w: ... w.add_document(title="First document", content="Hello there.") ... w.add_document(title="Second document", content="This is easy!")
- abstract add_document(**fields)[source]¶
The keyword arguments map field names to the values to index/store:
w = myindex.writer() w.add_document(path=u"/a", title=u"First doc", text=u"Hello") w.commit()
Depending on the field type, some fields may take objects other than unicode strings. For example, NUMERIC fields take numbers, and DATETIME fields take
datetime.datetime
objects:from datetime import datetime, timedelta from whoosh import index from whoosh.fields import Schema, DATETIME, NUMERIC, TEXT schema = Schema(date=DATETIME, size=NUMERIC(float), content=TEXT) myindex = index.create_in("indexdir", schema) w = myindex.writer() w.add_document(date=datetime.now(), size=5.5, content=u"Hello") w.commit()
Instead of a single object (i.e., unicode string, number, or datetime), you can supply a list or tuple of objects. For unicode strings, this bypasses the field’s analyzer. For numbers and dates, this lets you add multiple values for the given field:
date1 = datetime.now() date2 = datetime(2005, 12, 25) date3 = datetime(1999, 1, 1) w.add_document(date=[date1, date2, date3], size=[9.5, 10], content=[u"alfa", u"bravo", u"charlie"])
For fields that are both indexed and stored, you can specify an alternate value to store using a keyword argument in the form “_stored_<fieldname>”. For example, if you have a field named “title” and you want to index the text “a b c” but store the text “e f g”, use keyword arguments like this:
writer.add_document(title=u"a b c", _stored_title=u"e f g")
You can boost the weight of all terms in a certain field by specifying a
_<fieldname>_boost
keyword argument. For example, if you have a field named “content”, you can double the weight of this document for searches in the “content” field like this:writer.add_document(content="a b c", _title_boost=2.0)
You can boost every field at once using the
_boost
keyword. For example, to boost fields “a” and “b” by 2.0, and field “c” by 3.0:writer.add_document(a="alfa", b="bravo", c="charlie", _boost=2.0, _c_boost=3.0)
Note that some scoring algroithms, including Whoosh’s default BM25F, do not work with term weights less than 1, so you should generally not use a boost factor less than 1.
See also
Writer.update_document()
.
- add_field(fieldname, fieldtype, **kwargs)[source]¶
Adds a field to the index’s schema.
- Parameters:
fieldname – the name of the field to add.
fieldtype – an instantiated
whoosh.fields.FieldType
object.
- delete_by_query(q, searcher=None)[source]¶
Deletes any documents matching a query object.
- Returns:
the number of documents deleted.
- delete_by_term(fieldname, text, searcher=None)[source]¶
Deletes any documents containing “term” in the “fieldname” field. This is useful when you have an indexed field containing a unique ID (such as “pathname”) for each document.
- Returns:
the number of documents deleted.
- end_group()[source]¶
Finish indexing a group of hierarchical documents. See
start_group()
.
- group()[source]¶
Returns a context manager that calls
start_group()
andend_group()
for you, allowing you to use awith
statement to group hierarchical documents:with myindex.writer() as w: with w.group(): w.add_document(kind="class", name="Accumulator") w.add_document(kind="method", name="add") w.add_document(kind="method", name="get_result") w.add_document(kind="method", name="close") with w.group(): w.add_document(kind="class", name="Calculator") w.add_document(kind="method", name="add") w.add_document(kind="method", name="multiply") w.add_document(kind="method", name="get_result") w.add_document(kind="method", name="close")
- remove_field(fieldname, **kwargs)[source]¶
Removes the named field from the index’s schema. Depending on the backend implementation, this may or may not actually remove existing data for the field from the index. Optimizing the index should always clear out existing data for a removed field.
- start_group()[source]¶
Start indexing a group of hierarchical documents. The backend should ensure that these documents are all added to the same segment:
with myindex.writer() as w: w.start_group() w.add_document(kind="class", name="Accumulator") w.add_document(kind="method", name="add") w.add_document(kind="method", name="get_result") w.add_document(kind="method", name="close") w.end_group() w.start_group() w.add_document(kind="class", name="Calculator") w.add_document(kind="method", name="add") w.add_document(kind="method", name="multiply") w.add_document(kind="method", name="get_result") w.add_document(kind="method", name="close") w.end_group()
A more convenient way to group documents is to use the
group()
method and thewith
statement.
- update_document(**fields)[source]¶
The keyword arguments map field names to the values to index/store.
This method adds a new document to the index, and automatically deletes any documents with the same values in any fields marked “unique” in the schema:
schema = fields.Schema(path=fields.ID(unique=True, stored=True), content=fields.TEXT) myindex = index.create_in("index", schema) w = myindex.writer() w.add_document(path=u"/", content=u"Mary had a lamb") w.commit() w = myindex.writer() w.update_document(path=u"/", content=u"Mary had a little lamb") w.commit() assert myindex.doc_count() == 1
It is safe to use
update_document
in place ofadd_document
; if there is no existing document to replace, it simply does an add.You cannot currently pass a list or tuple of values to a “unique” field.
Because this method has to search for documents with the same unique fields and delete them before adding the new document, it is slower than using
add_document
.Marking more fields “unique” in the schema will make each
update_document
call slightly slower.When you are updating multiple documents, it is faster to batch delete all changed documents and then use
add_document
to add the replacements instead of usingupdate_document
.
Note that this method will only replace a committed document; currently it cannot replace documents you’ve added to the IndexWriter but haven’t yet committed. For example, if you do this:
>>> writer.update_document(unique_id=u"1", content=u"Replace me") >>> writer.update_document(unique_id=u"1", content=u"Replacement")
…this will add two documents with the same value of
unique_id
, instead of the second document replacing the first.See
Writer.add_document()
for information on_stored_<fieldname>
,_<fieldname>_boost
, and_boost
keyword arguments.
Utility writers¶
- class whoosh.writing.BufferedWriter(index, period=60, limit=10, writerargs=None, commitargs=None)[source]¶
Convenience class that acts like a writer but buffers added documents before dumping the buffered documents as a batch into the actual index.
In scenarios where you are continuously adding single documents very rapidly (for example a web application where lots of users are adding content simultaneously), using a BufferedWriter is much faster than opening and committing a writer for each document you add. If you’re adding batches of documents at a time, you can just use a regular writer.
(This class may also be useful for batches of
update_document
calls. In a normal writer,update_document
calls cannot update documents you’ve added in that writer. WithBufferedWriter
, this will work.)To use this class, create it from your index and keep it open, sharing it between threads.
>>> from whoosh.writing import BufferedWriter >>> writer = BufferedWriter(myindex, period=120, limit=20) >>> # Then you can use the writer to add and update documents >>> writer.add_document(...) >>> writer.add_document(...) >>> writer.add_document(...) >>> # Before the writer goes out of scope, call close() on it >>> writer.close()
Note
This object stores documents in memory and may keep an underlying writer open, so you must explicitly call the
close()
method on this object before it goes out of scope to release the write lock and make sure any uncommitted changes are saved.You can read/search the combination of the on-disk index and the buffered documents in memory by calling
BufferedWriter.reader()
orBufferedWriter.searcher()
. This allows quasi-real-time search, where documents are available for searching as soon as they are buffered in memory, before they are committed to disk.Tip
By using a searcher from the shared writer, multiple threads can search the buffered documents. Of course, other processes will only see the documents that have been written to disk. If you want indexed documents to become available to other processes as soon as possible, you have to use a traditional writer instead of a
BufferedWriter
.You can control how often the
BufferedWriter
flushes the in-memory index to disk using theperiod
andlimit
arguments.period
is the maximum number of seconds between commits.limit
is the maximum number of additions to buffer between commits.You don’t need to call
commit()
on theBufferedWriter
manually. Doing so will just flush the buffered documents to disk early. You can continue to make changes after callingcommit()
, and you can callcommit()
multiple times.- Parameters:
index – the
whoosh.index.Index
to write to.period – the maximum amount of time (in seconds) between commits. Set this to
0
orNone
to not use a timer. Do not set this any lower than a few seconds.limit – the maximum number of documents to buffer before committing.
writerargs – dictionary specifying keyword arguments to be passed to the index’s
writer()
method when creating a writer.
- add_document(**fields)[source]¶
The keyword arguments map field names to the values to index/store:
w = myindex.writer() w.add_document(path=u"/a", title=u"First doc", text=u"Hello") w.commit()
Depending on the field type, some fields may take objects other than unicode strings. For example, NUMERIC fields take numbers, and DATETIME fields take
datetime.datetime
objects:from datetime import datetime, timedelta from whoosh import index from whoosh.fields import Schema, DATETIME, NUMERIC, TEXT schema = Schema(date=DATETIME, size=NUMERIC(float), content=TEXT) myindex = index.create_in("indexdir", schema) w = myindex.writer() w.add_document(date=datetime.now(), size=5.5, content=u"Hello") w.commit()
Instead of a single object (i.e., unicode string, number, or datetime), you can supply a list or tuple of objects. For unicode strings, this bypasses the field’s analyzer. For numbers and dates, this lets you add multiple values for the given field:
date1 = datetime.now() date2 = datetime(2005, 12, 25) date3 = datetime(1999, 1, 1) w.add_document(date=[date1, date2, date3], size=[9.5, 10], content=[u"alfa", u"bravo", u"charlie"])
For fields that are both indexed and stored, you can specify an alternate value to store using a keyword argument in the form “_stored_<fieldname>”. For example, if you have a field named “title” and you want to index the text “a b c” but store the text “e f g”, use keyword arguments like this:
writer.add_document(title=u"a b c", _stored_title=u"e f g")
You can boost the weight of all terms in a certain field by specifying a
_<fieldname>_boost
keyword argument. For example, if you have a field named “content”, you can double the weight of this document for searches in the “content” field like this:writer.add_document(content="a b c", _title_boost=2.0)
You can boost every field at once using the
_boost
keyword. For example, to boost fields “a” and “b” by 2.0, and field “c” by 3.0:writer.add_document(a="alfa", b="bravo", c="charlie", _boost=2.0, _c_boost=3.0)
Note that some scoring algroithms, including Whoosh’s default BM25F, do not work with term weights less than 1, so you should generally not use a boost factor less than 1.
See also
Writer.update_document()
.
- update_document(**fields)[source]¶
The keyword arguments map field names to the values to index/store.
This method adds a new document to the index, and automatically deletes any documents with the same values in any fields marked “unique” in the schema:
schema = fields.Schema(path=fields.ID(unique=True, stored=True), content=fields.TEXT) myindex = index.create_in("index", schema) w = myindex.writer() w.add_document(path=u"/", content=u"Mary had a lamb") w.commit() w = myindex.writer() w.update_document(path=u"/", content=u"Mary had a little lamb") w.commit() assert myindex.doc_count() == 1
It is safe to use
update_document
in place ofadd_document
; if there is no existing document to replace, it simply does an add.You cannot currently pass a list or tuple of values to a “unique” field.
Because this method has to search for documents with the same unique fields and delete them before adding the new document, it is slower than using
add_document
.Marking more fields “unique” in the schema will make each
update_document
call slightly slower.When you are updating multiple documents, it is faster to batch delete all changed documents and then use
add_document
to add the replacements instead of usingupdate_document
.
Note that this method will only replace a committed document; currently it cannot replace documents you’ve added to the IndexWriter but haven’t yet committed. For example, if you do this:
>>> writer.update_document(unique_id=u"1", content=u"Replace me") >>> writer.update_document(unique_id=u"1", content=u"Replacement")
…this will add two documents with the same value of
unique_id
, instead of the second document replacing the first.See
Writer.add_document()
for information on_stored_<fieldname>
,_<fieldname>_boost
, and_boost
keyword arguments.
- class whoosh.writing.AsyncWriter(index, delay=0.25, writerargs=None)[source]¶
Convenience wrapper for a writer object that might fail due to locking (i.e. the
filedb
writer). This object will attempt once to obtain the underlying writer, and if it’s successful, will simply pass method calls on to it.If this object can’t obtain a writer immediately, it will buffer delete, add, and update method calls in memory until you call
commit()
. At that point, this object will start running in a separate thread, trying to obtain the writer over and over, and once it obtains it, “replay” all the buffered method calls on it.In a typical scenario where you’re adding a single or a few documents to the index as the result of a Web transaction, this lets you just create the writer, add, and commit, without having to worry about index locks, retries, etc.
For example, to get an aynchronous writer, instead of this:
>>> writer = myindex.writer()
Do this:
>>> from whoosh.writing import AsyncWriter >>> writer = AsyncWriter(myindex)
- Parameters:
index – the
whoosh.index.Index
to write to.delay – the delay (in seconds) between attempts to instantiate the actual writer.
writerargs – an optional dictionary specifying keyword arguments to to be passed to the index’s
writer()
method.
- add_document(*args, **kwargs)[source]¶
The keyword arguments map field names to the values to index/store:
w = myindex.writer() w.add_document(path=u"/a", title=u"First doc", text=u"Hello") w.commit()
Depending on the field type, some fields may take objects other than unicode strings. For example, NUMERIC fields take numbers, and DATETIME fields take
datetime.datetime
objects:from datetime import datetime, timedelta from whoosh import index from whoosh.fields import Schema, DATETIME, NUMERIC, TEXT schema = Schema(date=DATETIME, size=NUMERIC(float), content=TEXT) myindex = index.create_in("indexdir", schema) w = myindex.writer() w.add_document(date=datetime.now(), size=5.5, content=u"Hello") w.commit()
Instead of a single object (i.e., unicode string, number, or datetime), you can supply a list or tuple of objects. For unicode strings, this bypasses the field’s analyzer. For numbers and dates, this lets you add multiple values for the given field:
date1 = datetime.now() date2 = datetime(2005, 12, 25) date3 = datetime(1999, 1, 1) w.add_document(date=[date1, date2, date3], size=[9.5, 10], content=[u"alfa", u"bravo", u"charlie"])
For fields that are both indexed and stored, you can specify an alternate value to store using a keyword argument in the form “_stored_<fieldname>”. For example, if you have a field named “title” and you want to index the text “a b c” but store the text “e f g”, use keyword arguments like this:
writer.add_document(title=u"a b c", _stored_title=u"e f g")
You can boost the weight of all terms in a certain field by specifying a
_<fieldname>_boost
keyword argument. For example, if you have a field named “content”, you can double the weight of this document for searches in the “content” field like this:writer.add_document(content="a b c", _title_boost=2.0)
You can boost every field at once using the
_boost
keyword. For example, to boost fields “a” and “b” by 2.0, and field “c” by 3.0:writer.add_document(a="alfa", b="bravo", c="charlie", _boost=2.0, _c_boost=3.0)
Note that some scoring algroithms, including Whoosh’s default BM25F, do not work with term weights less than 1, so you should generally not use a boost factor less than 1.
See also
Writer.update_document()
.
- add_field(*args, **kwargs)[source]¶
Adds a field to the index’s schema.
- Parameters:
fieldname – the name of the field to add.
fieldtype – an instantiated
whoosh.fields.FieldType
object.
- cancel(*args, **kwargs)[source]¶
Cancels any documents/deletions added by this object and unlocks the index.
- delete_by_term(*args, **kwargs)[source]¶
Deletes any documents containing “term” in the “fieldname” field. This is useful when you have an indexed field containing a unique ID (such as “pathname”) for each document.
- Returns:
the number of documents deleted.
- remove_field(*args, **kwargs)[source]¶
Removes the named field from the index’s schema. Depending on the backend implementation, this may or may not actually remove existing data for the field from the index. Optimizing the index should always clear out existing data for a removed field.
- run()[source]¶
Method representing the thread’s activity.
You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.
- update_document(*args, **kwargs)[source]¶
The keyword arguments map field names to the values to index/store.
This method adds a new document to the index, and automatically deletes any documents with the same values in any fields marked “unique” in the schema:
schema = fields.Schema(path=fields.ID(unique=True, stored=True), content=fields.TEXT) myindex = index.create_in("index", schema) w = myindex.writer() w.add_document(path=u"/", content=u"Mary had a lamb") w.commit() w = myindex.writer() w.update_document(path=u"/", content=u"Mary had a little lamb") w.commit() assert myindex.doc_count() == 1
It is safe to use
update_document
in place ofadd_document
; if there is no existing document to replace, it simply does an add.You cannot currently pass a list or tuple of values to a “unique” field.
Because this method has to search for documents with the same unique fields and delete them before adding the new document, it is slower than using
add_document
.Marking more fields “unique” in the schema will make each
update_document
call slightly slower.When you are updating multiple documents, it is faster to batch delete all changed documents and then use
add_document
to add the replacements instead of usingupdate_document
.
Note that this method will only replace a committed document; currently it cannot replace documents you’ve added to the IndexWriter but haven’t yet committed. For example, if you do this:
>>> writer.update_document(unique_id=u"1", content=u"Replace me") >>> writer.update_document(unique_id=u"1", content=u"Replacement")
…this will add two documents with the same value of
unique_id
, instead of the second document replacing the first.See
Writer.add_document()
for information on_stored_<fieldname>
,_<fieldname>_boost
, and_boost
keyword arguments.