gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• tech trainings & IT consulting

models – Package for transformation models

models – Package for transformation models

This package contains algorithms for extracting document representations from their raw bag-of-word counts.

class gensim.models.VocabTransform(old2new, id2token=None)

Remap feature ids to new values.

Given a mapping between old ids and new ids (some old ids may be missing = these features are to be discarded), this will wrap a corpus so that iterating over VocabTransform[corpus] returns the same vectors but with the new ids.

Old features that have no counterpart in the new ids are discarded. This can be used to filter vocabulary of a corpus “online”:

>>> old2new = dict((oldid, newid) for newid, oldid in enumerate(ids_you_want_to_keep))
>>> vt = VocabTransform(old2new)
>>> for vec_with_new_ids in vt[corpus_with_old_ids]:
>>>     ...
classmethod load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

save(fname, separately=None, sep_limit=10485760, ignore=frozenset([]))

Save the object to file (also see load).

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.