lang.wordnet
module¶
This module contains low-level functions and a high-level class for parsing the prolog file “wn_s.pl” from the WordNet prolog download into an object suitable for looking up synonyms and performing query expansion.
http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz
Thesaurus¶
- class whoosh.lang.wordnet.Thesaurus[source]¶
Represents the WordNet synonym database, either loaded into memory from the wn_s.pl Prolog file, or stored on disk in a Whoosh index.
This class allows you to parse the prolog file “wn_s.pl” from the WordNet prolog download into an object suitable for looking up synonyms and performing query expansion.
http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz
To load a Thesaurus object from the wn_s.pl file…
>>> t = Thesaurus.from_filename("wn_s.pl")
To save the in-memory Thesaurus to a Whoosh index…
>>> from whoosh.filedb.filestore import FileStorage >>> fs = FileStorage("index") >>> t.to_storage(fs)
To load a Thesaurus object from a Whoosh index…
>>> t = Thesaurus.from_storage(fs)
The Thesaurus object is thus usable in two ways:
Parse the wn_s.pl file into memory (Thesaurus.from_*) and then look up synonyms in memory. This has a startup cost for parsing the file, and uses quite a bit of memory to store two large dictionaries, however synonym look-ups are very fast.
Parse the wn_s.pl file into memory (Thesaurus.from_filename) then save it to an index (to_storage). From then on, open the thesaurus from the saved index (Thesaurus.from_storage). This has a large cost for storing the index, but after that it is faster to open the Thesaurus (than re-parsing the file) but slightly slower to look up synonyms.
Here are timings for various tasks on my (fast) Windows machine, which might give an idea of relative costs for in-memory vs. on-disk.
Task
Approx. time (s)
Parsing the wn_s.pl file
1.045
Saving to an on-disk index
13.084
Loading from an on-disk index
0.082
Look up synonyms for “light” (in memory)
0.0011
Look up synonyms for “light” (loaded from disk)
0.0028
Basically, if you can afford spending the memory necessary to parse the Thesaurus and then cache it, it’s faster. Otherwise, use an on-disk index.
- classmethod from_file(fileobj)[source]¶
Creates a Thesaurus object from the given file-like object, which should contain the WordNet wn_s.pl file.
>>> f = open("wn_s.pl") >>> t = Thesaurus.from_file(f) >>> t.synonyms("hail") ['acclaim', 'come', 'herald']
- classmethod from_filename(filename)[source]¶
Creates a Thesaurus object from the given filename, which should contain the WordNet wn_s.pl file.
>>> t = Thesaurus.from_filename("wn_s.pl") >>> t.synonyms("hail") ['acclaim', 'come', 'herald']
- classmethod from_storage(storage, indexname='THES')[source]¶
Creates a Thesaurus object from the given storage object, which should contain an index created by Thesaurus.to_storage().
>>> from whoosh.filedb.filestore import FileStorage >>> fs = FileStorage("index") >>> t = Thesaurus.from_storage(fs) >>> t.synonyms("hail") ['acclaim', 'come', 'herald']
- Parameters:
storage – A
whoosh.store.Storage
object from which to load the index.indexname – A name for the index. This allows you to store multiple indexes in the same storage object.
- synonyms(word)[source]¶
Returns a list of synonyms for the given word.
>>> thesaurus.synonyms("hail") ['acclaim', 'come', 'herald']
- to_storage(storage, indexname='THES')[source]¶
Creates am index in the given storage object from the synonyms loaded from a WordNet file.
>>> from whoosh.filedb.filestore import FileStorage >>> fs = FileStorage("index") >>> t = Thesaurus.from_filename("wn_s.pl") >>> t.to_storage(fs)
- Parameters:
storage – A
whoosh.store.Storage
object in which to save the index.indexname – A name for the index. This allows you to store multiple indexes in the same storage object.
Low-level functions¶
- whoosh.lang.wordnet.parse_file(f)[source]¶
Parses the WordNet wn_s.pl prolog file and returns two dictionaries: word2nums and num2words.