Model

class bitermplus.BTM(n_dw, vocabulary, int T, int M=20, double alpha=1., double beta=0.01, unsigned int seed=0, int win=15, bool has_background=False)

Biterm Topic Model.

Parameters:
  • n_dw (csr.csr_matrix) – Documents vs words frequency matrix. Typically, it should be the output of CountVectorizer from sklearn package.

  • vocabulary (list) – Vocabulary (a list of words).

  • T (int) – Number of topics.

  • M (int = 20) – Number of top words for coherence calculation.

  • alpha (float = 1) – Model parameter.

  • beta (float = 0.01) – Model parameter.

  • seed (int = 0) – Random state seed. If seed is equal to 0 (default), use time(NULL).

  • win (int = 15) – Biterms generation window.

  • has_background (bool = False) – Use a background topic to accumulate highly frequent words.

alpha_

float Model parameter.

Type:

BTM.alpha_

beta_

float Model parameter.

Type:

BTM.beta_

biterms_

np.ndarray Model biterms. Terms are coded with the corresponding ids.

Type:

BTM.biterms_

coherence_

np.ndarray Semantic topics coherence.

Type:

BTM.coherence_

coherence_window_

int Number of top words for coherence calculation.

Type:

BTM.coherence_window_

df_words_topics_

DataFrame Words vs topics probabilities in a DataFrame.

Type:

BTM.df_words_topics_

fit(self, list Bs, int iterations=600, bool verbose=True)

Biterm topic model fitting method.

Parameters:
  • Bs (list) – Biterms list.

  • iterations (int = 600) – Iterations number.

  • verbose (bool = True) – Show progress bar.

fit_transform(self, docs, list biterms, unicode infer_type=u'sum_b', int iterations=600, bool verbose=True)

Run model fitting and return documents vs topics matrix.

Parameters:
  • docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().

  • biterms (list) – List of biterms.

  • infer_type (str) –

    Inference type. The following options are available:

    1. sum_b (default).

    2. sum_w.

    3. mix.

  • iterations (int = 600) – Iterations number.

  • verbose (bool = True) – Be verbose (show progress bars).

Returns:

p_zd – Documents vs topics matrix (D x T).

Return type:

np.ndarray

has_background_

bool Specifies whether the model has a background topic

to accumulate highly frequent words.

Type:

BTM.has_background_

iterations_

int Number of iterations the model fitting process has

gone through.

Type:

BTM.iterations_

labels_

np.ndarray Model document labels (most probable topic for each document).

Type:

BTM.labels_

matrix_docs_topics_

np.ndarray Documents vs topics probabilities matrix.

Type:

BTM.matrix_docs_topics_

matrix_topics_docs_

np.ndarray Topics vs documents probabilities matrix.

Type:

BTM.matrix_topics_docs_

matrix_topics_words_

np.ndarray Topics vs words probabilities matrix.

Type:

BTM.matrix_topics_words_

matrix_words_topics_

np.ndarray Words vs topics probabilities matrix.

Type:

BTM.matrix_words_topics_

perplexity_

float Perplexity.

Run transform method before calculating perplexity

Type:

BTM.perplexity_

theta_

np.ndarray Topics probabilities vector.

Type:

BTM.theta_

topics_num_

int Number of topics.

Type:

BTM.topics_num_

transform(self, list docs, unicode infer_type=u'sum_b', bool verbose=True)

Return documents vs topics probability matrix.

Parameters:
  • docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().

  • infer_type (str) –

    Inference type. The following options are available:

    1. sum_b (default).

    2. sum_w.

    3. mix.

  • verbose (bool = True) – Be verbose (show progress bar).

Returns:

p_zd – Documents vs topics probability matrix (D vs T).

Return type:

np.ndarray

vocabulary_

np.ndarray Vocabulary (list of words).

Type:

BTM.vocabulary_

vocabulary_size_

int Vocabulary size (number of words).

Type:

BTM.vocabulary_size_

window_

int Biterms generation window size.

Type:

BTM.window_