Metrics
- bitermplus.coherence(double[:, :] p_wz, n_dw, double eps=1., int M=20)
Semantic topic coherence calculation [1]_.
- Parameters:
p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).
n_dw (scipy.sparse.csr_matrix) – Words frequency matrix for all documents (D x W).
eps (float) – Calculation parameter. It is summed with a word pair conditional probability.
M (int) – Number of top words in a topic to take.
- Returns:
coherence – Semantic coherence estimates for all topics.
- Return type:
np.ndarray
References
Example
>>> import bitermplus as btm >>> # Preprocessing step >>> # ... >>> # X, vocabulary, vocab_dict = btm.get_words_freqs(texts) >>> # Model fitting step >>> # model = ... >>> # Coherence calculation >>> coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
- bitermplus.perplexity(double[:, :] p_wz, double[:, :] p_zd, n_dw, long T) double
Perplexity calculation [1]_.
- Parameters:
p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).
p_zd (np.ndarray) – Documents vs topics probabilities matrix (D x T).
n_dw (scipy.sparse.csr_matrix) – Words frequency matrix for all documents (D x W).
T (int) – Number of topics.
- Returns:
perplexity – Perplexity estimate.
- Return type:
float
References
[1] Heinrich, G. (2005). Parameter estimation for text analysis (pp. 1-32). Technical report.
Example
>>> import bitermplus as btm >>> # Preprocessing step >>> # ... >>> # X, vocabulary, vocab_dict = btm.get_words_freqs(texts) >>> # Model fitting step >>> # model = ... >>> # Inference step >>> # p_zd = model.transform(docs_vec_subset) >>> # Coherence calculation >>> perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
- bitermplus.entropy(double[:, :] p_wz, bool max_probs=True)
Renyi entropy calculation routine [1]_.
Renyi entropy can be used to estimate the optimal number of topics: just fit several models with a different number of topics and choose the number of topics for which the Renyi entropy is the least.
- Parameters:
p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).
- Returns:
renyi (double) – Renyi entropy value.
max_probs (bool) – Use maximum probabilities of terms per topics instead of all probability values.
References
[1] Koltcov, S. (2018). Application of Rényi and Tsallis entropies to topic modeling optimization. Physica A: Statistical Mechanics and its Applications, 512, 1192-1204.
Example
>>> import bitermplus as btm >>> # Preprocessing step >>> # ... >>> # Model fitting step >>> # model = ... >>> # Entropy calculation >>> entropy = btm.entropy(model.matrix_topics_words_)