Model
- class bitermplus.BTM(n_dw, vocabulary, int T, int M=20, double alpha=1., double beta=0.01, unsigned int seed=0, int win=15, bool has_background=False, double epsilon=1e-10)
Biterm Topic Model for Short Text Analysis.
This class implements the Biterm Topic Model (BTM) algorithm, specifically designed for short text analysis such as tweets, reviews, and messages. Unlike traditional topic models like LDA, BTM extracts biterms (word pairs) from the entire corpus to overcome data sparsity issues in short texts.
The implementation is highly optimized with Cython and NumPy vectorization for efficient processing of large datasets.
- Parameters:
n_dw (scipy.sparse.csr_matrix) – Documents vs words frequency matrix. This should be the output of scikit-learn’s CountVectorizer.fit_transform() method.
vocabulary (array-like) – Vocabulary array containing the words/terms corresponding to the columns in n_dw matrix.
T (int) – Number of topics to extract from the corpus.
M (int, default=20) – Number of top words used for coherence calculation. This affects the semantic coherence metric computation.
alpha (float, default=1.0) – Dirichlet prior parameter for topic distribution. Controls the sparsity of topic assignments. Higher values create more uniform topic distributions.
beta (float, default=0.01) – Dirichlet prior parameter for word distribution within topics. Controls topic-word sparsity. Lower values create more focused topics.
seed (int, default=0) – Random state seed for reproducible results. If 0, uses current time as seed (non-reproducible). Set to a fixed integer for reproducibility.
win (int, default=15) – Window size for biterm generation. Biterms are extracted from words within this window distance in each document.
has_background (bool, default=False) – Whether to use a background topic to model highly frequent words that appear across many topics (e.g., stop words).
epsilon (float, default=1e-10) – Small numerical constant to prevent division by zero and improve numerical stability in probability calculations.
- matrix_topics_words_
Topics × words probability matrix (T × V).
- Type:
numpy.ndarray
- matrix_docs_topics_
Documents × topics probability matrix (D × T).
- Type:
numpy.ndarray
- vocabulary_
The vocabulary used by the model.
- Type:
numpy.ndarray
- coherence_
Semantic coherence score for each topic.
- Type:
numpy.ndarray
- perplexity_
Model perplexity (lower is better).
- Type:
float
- theta_
Topic probability distribution.
- Type:
numpy.ndarray
Examples
>>> import bitermplus as btm >>> import pandas as pd >>> from sklearn.feature_extraction.text import CountVectorizer >>> >>> # Prepare data >>> texts = ["machine learning is great", "I love deep learning"] >>> vectorizer = CountVectorizer() >>> X = vectorizer.fit_transform(texts) >>> vocabulary = vectorizer.get_feature_names_out() >>> >>> # Create and fit model >>> model = btm.BTM(X, vocabulary, T=2, seed=42) >>> docs_vec = btm.get_vectorized_docs(texts, vocabulary) >>> biterms = btm.get_biterms(docs_vec) >>> model.fit(biterms, iterations=100) >>> >>> # Get results >>> doc_topics = model.transform(docs_vec) >>> print("Topics per document:", doc_topics.shape)
References
Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web (pp. 1445-1456).
Notes
This is a low-level interface. For easier usage, consider using the sklearn-compatible BTMClassifier class instead.
- alpha_
float
Model parameter.
- Type:
- biterms_
np.ndarray
Model biterms. Terms are coded with the corresponding ids.
- Type:
- coherence_
np.ndarray
Semantic topics coherence.
- Type:
- coherence_window_
int
Number of top words for coherence calculation.
- Type:
- df_words_topics_
DataFrame
Words vs topics probabilities in a DataFrame.
- Type:
- epsilon_
float
Numerical stability constant (epsilon) used to prevent division by zero.
- Type:
- fit(self, list Bs, int iterations=600, bool verbose=True)
Fit the Biterm Topic Model using Gibbs sampling.
This method trains the BTM model by iteratively sampling topic assignments for biterms using collapsed Gibbs sampling. The algorithm learns the topic-word and topic distributions from the biterm data.
- Parameters:
Bs (list of list of list) – List of biterms for each document. Each document’s biterms are represented as a list of [word_id1, word_id2] pairs. Obtained from get_biterms() function.
iterations (int, default=600) – Number of Gibbs sampling iterations. More iterations generally lead to better convergence but increase computation time.
verbose (bool, default=True) – Whether to show a progress bar during training.
- Returns:
self – Returns the fitted model instance.
- Return type:
- Raises:
ValueError – If no biterms are provided or all biterm lists are empty.
Examples
>>> import bitermplus as btm >>> # Assume biterms is prepared >>> model = btm.BTM(X, vocabulary, T=5) >>> model.fit(biterms, iterations=200, verbose=True)
- fit_transform(self, docs, list biterms, str infer_type='sum_b', int iterations=600, bool verbose=True)
Run model fitting and return documents vs topics matrix.
- Parameters:
docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of
bitermplus.get_vectorized_docs().biterms (list) – List of biterms.
infer_type (str) –
Inference type. The following options are available:
sum_b(default).sum_w.mix.
iterations (int = 600) – Iterations number.
verbose (bool = True) – Be verbose (show progress bars).
- Returns:
p_zd – Documents vs topics matrix (D x T).
- Return type:
np.ndarray
- has_background_
bool
Specifies whether the model has a background topic to accumulate highly frequent words.
- Type:
- iterations_
int
Number of iterations the model fitting process has gone through.
- Type:
- labels_
np.ndarray
Model document labels (most probable topic for each document).
- Type:
- matrix_docs_topics_
np.ndarray
Documents vs topics probabilities matrix.
- Type:
- matrix_topics_docs_
np.ndarray
Topics vs documents probabilities matrix.
- Type:
- matrix_topics_words_
np.ndarray
Topics vs words probabilities matrix.
- Type:
- matrix_words_topics_
np.ndarray
Words vs topics probabilities matrix.
- Type:
- perplexity_
float
Perplexity.
Run transform method before calculating perplexity
- Type:
- theta_
np.ndarray
Topics probabilities vector.
- Type:
- topics_num_
int
Number of topics.
- Type:
- transform(self, list docs, str infer_type='sum_b', bool verbose=True)
Transform documents to topic probability distributions.
Infers topic distributions for new documents using the trained BTM model. This method uses different inference strategies to estimate the probability of each topic for each document.
- Parameters:
docs (list of numpy.ndarray) – List of vectorized documents. Each document should be a numpy array of word IDs. Typically obtained from get_vectorized_docs() function.
infer_type ({'sum_b', 'sum_w', 'mix'}, default='sum_b') –
Inference method to use:
’sum_b’: Sum of biterms method (default). Uses biterm probabilities to infer document topics. Best for short texts.
’sum_w’: Sum of words method. Uses individual word probabilities. May work better for longer documents.
’mix’: Mixed method. Combines topic and word distributions.
verbose (bool, default=True) – Whether to show a progress bar during inference.
- Returns:
p_zd – Document-topic probability matrix. Each row sums to 1.0 and represents the topic distribution for the corresponding document.
- Return type:
numpy.ndarray, shape (n_documents, n_topics)
Examples
>>> # Assuming model is fitted and docs_vec is prepared >>> doc_topics = model.transform(docs_vec) >>> print(f"Shape: {doc_topics.shape}") >>> print(f"Topic distribution for first doc: {doc_topics[0]}")
>>> # Using different inference types >>> topics_biterm = model.transform(docs_vec, infer_type='sum_b') >>> topics_word = model.transform(docs_vec, infer_type='sum_w')
Notes
The model must be fitted before calling this method. Different inference types may give different results, with ‘sum_b’ generally preferred for short texts.
- vocabulary_
np.ndarray
Vocabulary (list of words).
- Type:
- vocabulary_size_
int
Vocabulary size (number of words).
- Type:
- window_
int
Biterms generation window size.
- Type: