Model

class bitermplus.BTM(n_dw, vocabulary, int T, int M=20, double alpha=1., double beta=0.01, unsigned int seed=0, int win=15, bool has_background=False)

Biterm Topic Model.

Parameters:

n_dw (csr.csr_matrix) – Documents vs words frequency matrix. Typically, it should be the output of CountVectorizer from sklearn package.
vocabulary (list) – Vocabulary (a list of words).
T (int) – Number of topics.
M (int = 20) – Number of top words for coherence calculation.
alpha (float = 1) – Model parameter.
beta (float = 0.01) – Model parameter.
seed (int = 0) – Random state seed. If seed is equal to 0 (default), use time(NULL).
win (int = 15) – Biterms generation window.
has_background (bool = False) – Use a background topic to accumulate highly frequent words.

alpha_: Model parameter.

beta_: Model parameter.

biterms_: Model biterms. Terms are coded with the corresponding ids.

coherence_: Semantic topics coherence.

coherence_window_: Number of top words for coherence calculation.

df_words_topics_: Words vs topics probabilities in a DataFrame.

fit(self, list Bs, int iterations=600, bool verbose=True)

Biterm topic model fitting method.

Parameters:

Bs (list) – Biterms list.
iterations (int = 600) – Iterations number.
verbose (bool = True) – Show progress bar.

fit_transform(self, docs, list biterms, unicode infer_type=u'sum_b', int iterations=600, bool verbose=True)

Run model fitting and return documents vs topics matrix.

Parameters:

docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().
biterms (list) – List of biterms.
infer_type (str) –
Inference type. The following options are available:
1. sum_b (default).
2. sum_w.
3. mix.
iterations (int = 600) – Iterations number.
verbose (bool = True) – Be verbose (show progress bars).

Returns:

p_zd – Documents vs topics matrix (D x T).

Return type:

np.ndarray

has_background_: Specifies whether the model has a background topic to accumulate highly frequent words.

iterations_: Number of iterations the model fitting process has gone through.

labels_: Model document labels (most probable topic for each document).

matrix_docs_topics_: Documents vs topics probabilities matrix.

matrix_topics_docs_: Topics vs documents probabilities matrix.

matrix_topics_words_: Topics vs words probabilities matrix.

matrix_words_topics_: Words vs topics probabilities matrix.

perplexity_

Perplexity.

Run transform method before calculating perplexity

theta_: Topics probabilities vector.

topics_num_: Number of topics.

transform(self, list docs, unicode infer_type=u'sum_b', bool verbose=True)

Return documents vs topics probability matrix.

Parameters:

docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().
infer_type (str) –
Inference type. The following options are available:
1. sum_b (default).
2. sum_w.
3. mix.
verbose (bool = True) – Be verbose (show progress bar).

Returns:

p_zd – Documents vs topics probability matrix (D vs T).

Return type:

np.ndarray

vocabulary_: Vocabulary (list of words).

vocabulary_size_: Vocabulary size (number of words).

window_: Biterms generation window size.