Metrics

bitermplus.coherence(double[:, :] p_wz, n_dw, double eps=1., int M=20)

Semantic topic coherence calculation [1]_.

Parameters:

p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).
n_dw (scipy.sparse.csr_matrix) – Words frequency matrix for all documents (D x W).
eps (float) – Calculation parameter. It is summed with a word pair conditional probability.
M (int) – Number of top words in a topic to take.

Returns:

coherence – Semantic coherence estimates for all topics.

Return type:

np.ndarray

References

Example

>>> import bitermplus as btm
>>> # Preprocessing step
>>> # ...
>>> # X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> # Model fitting step
>>> # model = ...
>>> # Coherence calculation
>>> coherence = btm.coherence(model.matrix_topics_words_, X, M=20)

bitermplus.perplexity(double[:, :] p_wz, double[:, :] p_zd, n_dw, long T) → double

Perplexity calculation [1]_.

Parameters:

p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).
p_zd (np.ndarray) – Documents vs topics probabilities matrix (D x T).
n_dw (scipy.sparse.csr_matrix) – Words frequency matrix for all documents (D x W).
T (int) – Number of topics.

Returns:

perplexity – Perplexity estimate.

Return type:

float

References

Example

>>> import bitermplus as btm
>>> # Preprocessing step
>>> # ...
>>> # X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> # Model fitting step
>>> # model = ...
>>> # Inference step
>>> # p_zd = model.transform(docs_vec_subset)
>>> # Coherence calculation
>>> perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)

bitermplus.entropy(double[:, :] p_wz, bool max_probs=True)

Renyi entropy calculation routine [1]_.

Renyi entropy can be used to estimate the optimal number of topics: just fit several models with a different number of topics and choose the number of topics for which the Renyi entropy is the least.

Parameters:

p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).

Returns:

renyi (double) – Renyi entropy value.
max_probs (bool) – Use maximum probabilities of terms per topics instead of all probability values.

References

Example

>>> import bitermplus as btm
>>> # Preprocessing step
>>> # ...
>>> # Model fitting step
>>> # model = ...
>>> # Entropy calculation
>>> entropy = btm.entropy(model.matrix_topics_words_)