gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• tech trainings & IT consulting

corpora.mmcorpus – Corpus in Matrix Market format

corpora.mmcorpus – Corpus in Matrix Market format

Corpus in the Matrix Market format.

class gensim.corpora.mmcorpus.MmCorpus(fname)

Corpus in the Matrix Market format.

docbyoffset(offset)

Return document at file offset offset (in bytes)

classmethod load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

static save_corpus(fname, corpus, id2word=None, progress_cnt=1000, metadata=False)

Save a corpus in the Matrix Market format to disk.

This function is automatically called by MmCorpus.serialize; don’t call it directly, call serialize instead.

classmethod serialize(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).

This relies on the underlying corpus class serializer providing (in addition to standard iteration):

  • save_corpus method that returns a sequence of byte offsets, one for

    each saved document,

  • the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).

Example:

>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.
skip_headers(input_file)

Skip file headers that appear before the first document.