Model

class bitermplus.BTM(n_dw, vocabulary, int T, int M=20, double alpha=1., double beta=0.01, unsigned int seed=0, int win=15, bool has_background=False)

Biterm Topic Model.

Parameters:
  • n_dw (csr.csr_matrix) – Documents vs words frequency matrix. Typically, it should be the output of CountVectorizer from sklearn package.

  • vocabulary (list) – Vocabulary (a list of words).

  • T (int) – Number of topics.

  • M (int = 20) – Number of top words for coherence calculation.

  • alpha (float = 1) – Model parameter.

  • beta (float = 0.01) – Model parameter.

  • seed (int = 0) – Random state seed. If seed is equal to 0 (default), use time(NULL).

  • win (int = 15) – Biterms generation window.

  • has_background (bool = False) – Use a background topic to accumulate highly frequent words.

alpha_

Model parameter.

beta_

Model parameter.

biterms_

Model biterms. Terms are coded with the corresponding ids.

coherence_

Semantic topics coherence.

coherence_window_

Number of top words for coherence calculation.

df_words_topics_

Words vs topics probabilities in a DataFrame.

fit(self, list Bs, int iterations=600, bool verbose=True)

Biterm topic model fitting method.

Parameters:
  • Bs (list) – Biterms list.

  • iterations (int = 600) – Iterations number.

  • verbose (bool = True) – Show progress bar.

fit_transform(self, docs, list biterms, unicode infer_type=u'sum_b', int iterations=600, bool verbose=True)

Run model fitting and return documents vs topics matrix.

Parameters:
  • docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().

  • biterms (list) – List of biterms.

  • infer_type (str) –

    Inference type. The following options are available:

    1. sum_b (default).

    2. sum_w.

    3. mix.

  • iterations (int = 600) – Iterations number.

  • verbose (bool = True) – Be verbose (show progress bars).

Returns:

p_zd – Documents vs topics matrix (D x T).

Return type:

np.ndarray

has_background_

Specifies whether the model has a background topic to accumulate highly frequent words.

iterations_

Number of iterations the model fitting process has gone through.

labels_

Model document labels (most probable topic for each document).

matrix_docs_topics_

Documents vs topics probabilities matrix.

matrix_topics_docs_

Topics vs documents probabilities matrix.

matrix_topics_words_

Topics vs words probabilities matrix.

matrix_words_topics_

Words vs topics probabilities matrix.

perplexity_

Perplexity.

Run transform method before calculating perplexity

theta_

Topics probabilities vector.

topics_num_

Number of topics.

transform(self, list docs, unicode infer_type=u'sum_b', bool verbose=True)

Return documents vs topics probability matrix.

Parameters:
  • docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().

  • infer_type (str) –

    Inference type. The following options are available:

    1. sum_b (default).

    2. sum_w.

    3. mix.

  • verbose (bool = True) – Be verbose (show progress bar).

Returns:

p_zd – Documents vs topics probability matrix (D vs T).

Return type:

np.ndarray

vocabulary_

Vocabulary (list of words).

vocabulary_size_

Vocabulary size (number of words).

window_

Biterms generation window size.