Metrics

bitermplus.coherence(double[:, :] p_wz, n_dw, double eps=1., int M=20)

Semantic topic coherence calculation [1]_.

Parameters:
  • p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).

  • n_dw (scipy.sparse.csr_matrix) – Words frequency matrix for all documents (D x W).

  • eps (float) – Calculation parameter. It is summed with a word pair conditional probability.

  • M (int) – Number of top words in a topic to take.

Returns:

coherence – Semantic coherence estimates for all topics.

Return type:

np.ndarray

References

Example

>>> import bitermplus as btm
>>> # Preprocessing step
>>> # ...
>>> # X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> # Model fitting step
>>> # model = ...
>>> # Coherence calculation
>>> coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
bitermplus.perplexity(double[:, :] p_wz, double[:, :] p_zd, n_dw, long T) double

Perplexity calculation [1]_.

Parameters:
  • p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).

  • p_zd (np.ndarray) – Documents vs topics probabilities matrix (D x T).

  • n_dw (scipy.sparse.csr_matrix) – Words frequency matrix for all documents (D x W).

  • T (int) – Number of topics.

Returns:

perplexity – Perplexity estimate.

Return type:

float

References

Example

>>> import bitermplus as btm
>>> # Preprocessing step
>>> # ...
>>> # X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> # Model fitting step
>>> # model = ...
>>> # Inference step
>>> # p_zd = model.transform(docs_vec_subset)
>>> # Coherence calculation
>>> perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
bitermplus.entropy(double[:, :] p_wz, bool max_probs=True)

Renyi entropy calculation routine [1]_.

Renyi entropy can be used to estimate the optimal number of topics: just fit several models with a different number of topics and choose the number of topics for which the Renyi entropy is the least.

Parameters:

p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).

Returns:

  • renyi (double) – Renyi entropy value.

  • max_probs (bool) – Use maximum probabilities of terms per topics instead of all probability values.

References

Example

>>> import bitermplus as btm
>>> # Preprocessing step
>>> # ...
>>> # Model fitting step
>>> # model = ...
>>> # Entropy calculation
>>> entropy = btm.entropy(model.matrix_topics_words_)