Model

class bitermplus.BTM(n_dw, vocabulary, int T, int M=20, double alpha=1., double beta=0.01, unsigned int seed=0, int win=15, bool has_background=False, double epsilon=1e-10)

Biterm Topic Model for Short Text Analysis.

This class implements the Biterm Topic Model (BTM) algorithm, specifically designed for short text analysis such as tweets, reviews, and messages. Unlike traditional topic models like LDA, BTM extracts biterms (word pairs) from the entire corpus to overcome data sparsity issues in short texts.

The implementation is highly optimized with Cython and NumPy vectorization for efficient processing of large datasets.

Parameters:
  • n_dw (scipy.sparse.csr_matrix) – Documents vs words frequency matrix. This should be the output of scikit-learn’s CountVectorizer.fit_transform() method.

  • vocabulary (array-like) – Vocabulary array containing the words/terms corresponding to the columns in n_dw matrix.

  • T (int) – Number of topics to extract from the corpus.

  • M (int, default=20) – Number of top words used for coherence calculation. This affects the semantic coherence metric computation.

  • alpha (float, default=1.0) – Dirichlet prior parameter for topic distribution. Controls the sparsity of topic assignments. Higher values create more uniform topic distributions.

  • beta (float, default=0.01) – Dirichlet prior parameter for word distribution within topics. Controls topic-word sparsity. Lower values create more focused topics.

  • seed (int, default=0) – Random state seed for reproducible results. If 0, uses current time as seed (non-reproducible). Set to a fixed integer for reproducibility.

  • win (int, default=15) – Window size for biterm generation. Biterms are extracted from words within this window distance in each document.

  • has_background (bool, default=False) – Whether to use a background topic to model highly frequent words that appear across many topics (e.g., stop words).

  • epsilon (float, default=1e-10) – Small numerical constant to prevent division by zero and improve numerical stability in probability calculations.

matrix_topics_words_

Topics × words probability matrix (T × V).

Type:

numpy.ndarray

matrix_docs_topics_

Documents × topics probability matrix (D × T).

Type:

numpy.ndarray

vocabulary_

The vocabulary used by the model.

Type:

numpy.ndarray

coherence_

Semantic coherence score for each topic.

Type:

numpy.ndarray

perplexity_

Model perplexity (lower is better).

Type:

float

theta_

Topic probability distribution.

Type:

numpy.ndarray

Examples

>>> import bitermplus as btm
>>> import pandas as pd
>>> from sklearn.feature_extraction.text import CountVectorizer
>>>
>>> # Prepare data
>>> texts = ["machine learning is great", "I love deep learning"]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(texts)
>>> vocabulary = vectorizer.get_feature_names_out()
>>>
>>> # Create and fit model
>>> model = btm.BTM(X, vocabulary, T=2, seed=42)
>>> docs_vec = btm.get_vectorized_docs(texts, vocabulary)
>>> biterms = btm.get_biterms(docs_vec)
>>> model.fit(biterms, iterations=100)
>>>
>>> # Get results
>>> doc_topics = model.transform(docs_vec)
>>> print("Topics per document:", doc_topics.shape)

References

Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web (pp. 1445-1456).

Notes

This is a low-level interface. For easier usage, consider using the sklearn-compatible BTMClassifier class instead.

alpha_

float

Model parameter.

Type:

BTM.alpha_

beta_

float

Model parameter.

Type:

BTM.beta_

biterms_

np.ndarray

Model biterms. Terms are coded with the corresponding ids.

Type:

BTM.biterms_

coherence_

np.ndarray

Semantic topics coherence.

Type:

BTM.coherence_

coherence_window_

int

Number of top words for coherence calculation.

Type:

BTM.coherence_window_

df_words_topics_

DataFrame

Words vs topics probabilities in a DataFrame.

Type:

BTM.df_words_topics_

epsilon_

float

Numerical stability constant (epsilon) used to prevent division by zero.

Type:

BTM.epsilon_

fit(self, list Bs, int iterations=600, bool verbose=True)

Fit the Biterm Topic Model using Gibbs sampling.

This method trains the BTM model by iteratively sampling topic assignments for biterms using collapsed Gibbs sampling. The algorithm learns the topic-word and topic distributions from the biterm data.

Parameters:
  • Bs (list of list of list) – List of biterms for each document. Each document’s biterms are represented as a list of [word_id1, word_id2] pairs. Obtained from get_biterms() function.

  • iterations (int, default=600) – Number of Gibbs sampling iterations. More iterations generally lead to better convergence but increase computation time.

  • verbose (bool, default=True) – Whether to show a progress bar during training.

Returns:

self – Returns the fitted model instance.

Return type:

BTM

Raises:

ValueError – If no biterms are provided or all biterm lists are empty.

Examples

>>> import bitermplus as btm
>>> # Assume biterms is prepared
>>> model = btm.BTM(X, vocabulary, T=5)
>>> model.fit(biterms, iterations=200, verbose=True)
fit_transform(self, docs, list biterms, str infer_type='sum_b', int iterations=600, bool verbose=True)

Run model fitting and return documents vs topics matrix.

Parameters:
  • docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of bitermplus.get_vectorized_docs().

  • biterms (list) – List of biterms.

  • infer_type (str) –

    Inference type. The following options are available:

    1. sum_b (default).

    2. sum_w.

    3. mix.

  • iterations (int = 600) – Iterations number.

  • verbose (bool = True) – Be verbose (show progress bars).

Returns:

p_zd – Documents vs topics matrix (D x T).

Return type:

np.ndarray

has_background_

bool

Specifies whether the model has a background topic to accumulate highly frequent words.

Type:

BTM.has_background_

iterations_

int

Number of iterations the model fitting process has gone through.

Type:

BTM.iterations_

labels_

np.ndarray

Model document labels (most probable topic for each document).

Type:

BTM.labels_

matrix_docs_topics_

np.ndarray

Documents vs topics probabilities matrix.

Type:

BTM.matrix_docs_topics_

matrix_topics_docs_

np.ndarray

Topics vs documents probabilities matrix.

Type:

BTM.matrix_topics_docs_

matrix_topics_words_

np.ndarray

Topics vs words probabilities matrix.

Type:

BTM.matrix_topics_words_

matrix_words_topics_

np.ndarray

Words vs topics probabilities matrix.

Type:

BTM.matrix_words_topics_

perplexity_

float

Perplexity.

Run transform method before calculating perplexity

Type:

BTM.perplexity_

theta_

np.ndarray

Topics probabilities vector.

Type:

BTM.theta_

topics_num_

int

Number of topics.

Type:

BTM.topics_num_

transform(self, list docs, str infer_type='sum_b', bool verbose=True)

Transform documents to topic probability distributions.

Infers topic distributions for new documents using the trained BTM model. This method uses different inference strategies to estimate the probability of each topic for each document.

Parameters:
  • docs (list of numpy.ndarray) – List of vectorized documents. Each document should be a numpy array of word IDs. Typically obtained from get_vectorized_docs() function.

  • infer_type ({'sum_b', 'sum_w', 'mix'}, default='sum_b') –

    Inference method to use:

    • ’sum_b’: Sum of biterms method (default). Uses biterm probabilities to infer document topics. Best for short texts.

    • ’sum_w’: Sum of words method. Uses individual word probabilities. May work better for longer documents.

    • ’mix’: Mixed method. Combines topic and word distributions.

  • verbose (bool, default=True) – Whether to show a progress bar during inference.

Returns:

p_zd – Document-topic probability matrix. Each row sums to 1.0 and represents the topic distribution for the corresponding document.

Return type:

numpy.ndarray, shape (n_documents, n_topics)

Examples

>>> # Assuming model is fitted and docs_vec is prepared
>>> doc_topics = model.transform(docs_vec)
>>> print(f"Shape: {doc_topics.shape}")
>>> print(f"Topic distribution for first doc: {doc_topics[0]}")
>>> # Using different inference types
>>> topics_biterm = model.transform(docs_vec, infer_type='sum_b')
>>> topics_word = model.transform(docs_vec, infer_type='sum_w')

Notes

The model must be fitted before calling this method. Different inference types may give different results, with ‘sum_b’ generally preferred for short texts.

vocabulary_

np.ndarray

Vocabulary (list of words).

Type:

BTM.vocabulary_

vocabulary_size_

int

Vocabulary size (number of words).

Type:

BTM.vocabulary_size_

window_

int

Biterms generation window size.

Type:

BTM.window_