Objects of this class realize the transformation between word-document co-occurence matrix (integers) into a locally/globally weighted matrix (positive floats).
This is done by a log entropy normalization, optionally normalizing the resulting documents to unit length. The following formulas explain how to compute the log entropy weight for term i in document j:
local_weight_{i,j} = log(frequency_{i,j} + 1)
P_{i,j} = frequency_{i,j} / sum_j frequency_{i,j}
sum_j P_{i,j} * log(P_{i,j})
global_weight_i = 1 + ----------------------------
log(number_of_documents + 1)
final_weight_{i,j} = local_weight_{i,j} * global_weight_i
The main methods are:
a corpus.
log entropy normalized space.
>>> log_ent = LogEntropyModel(corpus)
>>> print(log_ent[some_doc])
>>> log_ent.save('/tmp/foo.log_ent_model')
Model persistency is achieved via its load/save methods.
normalize dictates whether the resulting vectors will be set to unit length.
Initialize internal statistics based on a training corpus. Called automatically from the constructor.
Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
Save the object to file (also see load).
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.