Model

class bitermplus.BTM(n_dw, vocabulary, int T, int M=20, double alpha=1., double beta=0.01, unsigned int seed=0, int win=15, bool has_background=False)

Biterm Topic Model.

Parameters:

n_dw (csr.csr_matrix) – Documents vs words frequency matrix. Typically, it should be the output of CountVectorizer from sklearn package.
vocabulary (list) – Vocabulary (a list of words).
T (int) – Number of topics.
M (int = 20) – Number of top words for coherence calculation.
alpha (float = 1) – Model parameter.
beta (float = 0.01) – Model parameter.
seed (int = 0) – Random state seed. If seed is equal to 0 (default), use time(NULL).
win (int = 15) – Biterms generation window.
has_background (bool = False) – Use a background topic to accumulate highly frequent words.

alpha_

float Model parameter.

Type:: BTM.alpha_

beta_

float Model parameter.

Type:: BTM.beta_

biterms_

np.ndarray Model biterms. Terms are coded with the corresponding ids.

Type:: BTM.biterms_

coherence_

np.ndarray Semantic topics coherence.

Type:: BTM.coherence_

coherence_window_

int Number of top words for coherence calculation.

Type:: BTM.coherence_window_

df_words_topics_

DataFrame Words vs topics probabilities in a DataFrame.

Type:: BTM.df_words_topics_

fit(self, list Bs, int iterations=600, bool verbose=True)

Biterm topic model fitting method.

Parameters:

Bs (list) – Biterms list.
iterations (int = 600) – Iterations number.
verbose (bool = True) – Show progress bar.

fit_transform(self, docs, list biterms, unicode infer_type=u'sum_b', int iterations=600, bool verbose=True)

Run model fitting and return documents vs topics matrix.

Parameters:

docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().
biterms (list) – List of biterms.
infer_type (str) –
Inference type. The following options are available:
1. sum_b (default).
2. sum_w.
3. mix.
iterations (int = 600) – Iterations number.
verbose (bool = True) – Be verbose (show progress bars).

Returns:

p_zd – Documents vs topics matrix (D x T).

Return type:

np.ndarray

has_background_

bool Specifies whether the model has a background topic

to accumulate highly frequent words.

Type:: BTM.has_background_

iterations_

int Number of iterations the model fitting process has

gone through.

Type:: BTM.iterations_

labels_

np.ndarray Model document labels (most probable topic for each document).

Type:: BTM.labels_

matrix_docs_topics_

np.ndarray Documents vs topics probabilities matrix.

Type:: BTM.matrix_docs_topics_

matrix_topics_docs_

np.ndarray Topics vs documents probabilities matrix.

Type:: BTM.matrix_topics_docs_

matrix_topics_words_

np.ndarray Topics vs words probabilities matrix.

Type:: BTM.matrix_topics_words_

matrix_words_topics_

np.ndarray Words vs topics probabilities matrix.

Type:: BTM.matrix_words_topics_

perplexity_

float Perplexity.

Run transform method before calculating perplexity

Type:: BTM.perplexity_

theta_

np.ndarray Topics probabilities vector.

Type:: BTM.theta_

topics_num_

int Number of topics.

Type:: BTM.topics_num_

transform(self, list docs, unicode infer_type=u'sum_b', bool verbose=True)

Return documents vs topics probability matrix.

Parameters:

docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().
infer_type (str) –
Inference type. The following options are available:
1. sum_b (default).
2. sum_w.
3. mix.
verbose (bool = True) – Be verbose (show progress bar).

Returns:

p_zd – Documents vs topics probability matrix (D vs T).

Return type:

np.ndarray

vocabulary_

np.ndarray Vocabulary (list of words).

Type:: BTM.vocabulary_

vocabulary_size_

int Vocabulary size (number of words).

Type:: BTM.vocabulary_size_

window_

int Biterms generation window size.

Type:: BTM.window_