Sklearn-style API

The bitermplus package now includes a sklearn-compatible API that makes topic modeling much easier and more intuitive. The BTMClassifier class provides a familiar interface for scikit-learn users and integrates seamlessly with ML pipelines.

Quick Start 

The new API reduces complex topic modeling workflows to just a few lines:

import bitermplus as btm

# Sample documents
texts = [
    "machine learning algorithms are powerful",
    "deep learning neural networks process data",
    "natural language processing understands text",
    "artificial intelligence transforms industries"
]

# Create and fit model (one step!)
model = btm.BTMClassifier(n_topics=2, random_state=42)
model.fit(texts)

# Get topic distributions
doc_topics = model.transform(texts)
print(f"Document-topic matrix shape: {doc_topics.shape}")

# Interpret topics
topic_words = model.get_topic_words(n_words=5)
for topic_id, words in topic_words.items():
    print(f"Topic {topic_id}: {', '.join(words)}")

API Comparison 

Traditional API (complex, multi-step):

# Multiple manual preprocessing steps
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)

# Model creation and fitting
model = btm.BTM(X, vocabulary, seed=42, T=3, M=20, alpha=50/3, beta=0.01)
model.fit(biterms, iterations=100)

# Inference
p_zd = model.transform(docs_vec)

New Sklearn API (simple, one-liner):

# Everything in one step!
model = btm.BTMClassifier(n_topics=3, random_state=42)
doc_topics = model.fit_transform(texts)

BTMClassifier Class 

class bitermplus.BTMClassifier(n_topics: int = 8, alpha: float | None = None, beta: float = 0.01, max_iter: int = 600, random_state: int | None = None, window_size: int = 15, has_background: bool = False, coherence_window: int = 20, vectorizer_params: Dict[str, Any] | None = None, epsilon: float = 1e-10)[source]

Bases: BaseEstimator, TransformerMixin

Sklearn-compatible Biterm Topic Model for short text analysis.

This class provides a scikit-learn compatible interface for the Biterm Topic Model, designed specifically for short text analysis such as tweets, reviews, and messages. Unlike traditional topic models like LDA, BTM extracts biterms (word pairs) from the entire corpus to overcome data sparsity issues in short texts.

The BTMClassifier automatically handles text preprocessing, vectorization, biterm generation, model training, and inference, making topic modeling as simple as calling fit() and transform().

Parameters:

n_topics (int, default=8) – Number of topics to extract from the corpus.
alpha (float, default=None) – Dirichlet prior parameter for topic distribution. Controls topic sparsity in documents. Higher values create more uniform topic distributions. If None, uses 50/n_topics as recommended in the original paper.
beta (float, default=0.01) – Dirichlet prior parameter for word distribution within topics. Controls topic-word sparsity. Lower values create more focused topics.
max_iter (int, default=600) – Maximum number of Gibbs sampling iterations for model training. More iterations generally improve convergence but increase training time.
random_state (int, default=None) – Random seed for reproducible results. Set to an integer for consistent results across runs.
window_size (int, default=15) – Window size for biterm generation. Biterms are extracted from word pairs within this window distance in each document.
has_background (bool, default=False) – Whether to use a background topic to model highly frequent words that appear across many topics (e.g., stop words).
coherence_window (int, default=20) – Number of top words used for coherence calculation. This affects the semantic coherence metric computation.
vectorizer_params (dict, default=None) – Additional parameters to pass to the internal CountVectorizer for text preprocessing. Common options include min_df, max_df, stop_words, etc.
epsilon (float, default=1e-10) – Small numerical constant to prevent division by zero and improve numerical stability in probability calculations.

model_

The fitted BTM model instance containing learned parameters.

Type:: BTM

vocabulary_

Vocabulary learned from training data (words corresponding to features).

Type:: numpy.ndarray

feature_names_out_

Alias for vocabulary_ for sklearn compatibility.

Type:: numpy.ndarray

n_features_in_

Number of features (vocabulary size) after preprocessing.

Type:: int

vectorizer_

The fitted vectorizer used for text preprocessing.

Type:: CountVectorizer

fit(X, y=None)[source]: Fit the BTM model to documents.

transform(X, infer_type='sum_b')[source]: Transform documents to topic probability distributions.

fit_transform(X, y=None, infer_type='sum_b')[source]: Fit model and transform documents in one step.

get_topic_words(topic_id=None, n_words=10)[source]: Get top words for topics.

get_document_topics(X, threshold=0.1)[source]: Get dominant topics for documents.

score(X, y=None)[source]: Return mean coherence score across topics.

Examples

Basic usage:

>>> import bitermplus as btm
>>> texts = [
...     "machine learning algorithms are powerful",
...     "deep learning neural networks process data",
...     "natural language processing understands text"
... ]
>>> model = btm.BTMClassifier(n_topics=2, random_state=42)
>>> model.fit(texts)
BTMClassifier(n_topics=2, random_state=42)
>>> doc_topics = model.transform(texts)
>>> print(f"Shape: {doc_topics.shape}")
Shape: (3, 2)

Getting topic words:

>>> topic_words = model.get_topic_words(n_words=5)
>>> for topic_id, words in topic_words.items():
...     print(f"Topic {topic_id}: {', '.join(words)}")

Using with sklearn pipelines:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import FunctionTransformer
>>> pipeline = Pipeline([
...     ('preprocess', FunctionTransformer(lambda x: [s.lower() for s in x])),
...     ('btm', btm.BTMClassifier(n_topics=3, random_state=42))
... ])
>>> topics = pipeline.fit_transform(texts)

References

Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web (pp. 1445-1456).

Core Methods 

The BTMClassifier follows the sklearn estimator interface:

fit(X, y=None): Train the BTM model on documents.
transform(X, infer_type=’sum_b’): Transform documents to topic probability distributions.
fit_transform(X, y=None, infer_type=’sum_b’): Fit model and transform documents in one step.
score(X, y=None): Return mean coherence score across topics.

Parameters 

n_topicsint, default=8: Number of topics to extract.
alphafloat, default=None: Dirichlet prior for topic distribution. If None, uses 50/n_topics.
betafloat, default=0.01: Dirichlet prior for word distribution.
max_iterint, default=600: Maximum iterations for model training.
random_stateint, default=None: Random seed for reproducible results.
window_sizeint, default=15: Window size for biterm generation.
vectorizer_paramsdict, default=None: Parameters for the internal CountVectorizer.
epsilonfloat, default=1e-10: Small numerical constant to prevent division by zero and improve numerical stability.

Topic Analysis Methods 

get_topic_words(topic_id=None, n_words=10): Get top words for topics. Returns list for single topic or dict for all topics.
get_document_topics(X, threshold=0.1): Get dominant topics for documents above probability threshold.

Properties 

coherence_np.ndarray: Topic coherence scores.
perplexity_float: Model perplexity (requires transform to be called first).
topic_word_matrix_np.ndarray: Topics × words probability matrix.
vocabulary_np.ndarray: Learned vocabulary.
n_features_in_int: Number of features (vocabulary size).

Sklearn Integration 

Cross-validation 

from sklearn.model_selection import cross_val_score

model = btm.BTMClassifier(n_topics=5, random_state=42)
scores = cross_val_score(model, texts, cv=3)
print(f"Mean coherence: {scores.mean():.3f}")

Pipeline Integration 

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

def preprocess_text(texts):
    return [text.lower().replace(',', '') for text in texts]

pipeline = Pipeline([
    ('preprocess', FunctionTransformer(preprocess_text)),
    ('btm', btm.BTMClassifier(n_topics=3, random_state=42))
])

doc_topics = pipeline.fit_transform(texts)

Grid Search 

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_topics': [3, 5, 8],
    'alpha': [0.1, 0.5, 1.0],
    'max_iter': [100, 300]
}

grid_search = GridSearchCV(
    btm.BTMClassifier(random_state=42),
    param_grid,
    cv=3,
    scoring=None  # Uses model's score method
)

grid_search.fit(texts)
best_model = grid_search.best_estimator_

Advanced Usage 

Custom Preprocessing 

Control text preprocessing with vectorizer_params:

custom_params = {
    'min_df': 2,           # Minimum document frequency
    'max_df': 0.8,         # Maximum document frequency
    'stop_words': 'english',  # Remove English stop words
    'lowercase': True,     # Convert to lowercase
    'token_pattern': r'\b[a-zA-Z]{3,}\b'  # Only words 3+ chars
}

model = btm.BTMClassifier(
    n_topics=5,
    vectorizer_params=custom_params
)

Inference Types 

Choose different inference methods:

model = btm.BTMClassifier(n_topics=5)
model.fit(texts)

# Different inference types
topics_sum_b = model.transform(new_texts, infer_type='sum_b')  # Default
topics_sum_w = model.transform(new_texts, infer_type='sum_w')  # Word-based
topics_mix = model.transform(new_texts, infer_type='mix')      # Mixed

Model Evaluation 

model = btm.BTMClassifier(n_topics=5, random_state=42)
model.fit(texts)

# Coherence per topic
coherence_scores = model.coherence_
print(f"Topic coherence: {coherence_scores}")

# Overall model quality
mean_coherence = model.score(texts)
print(f"Mean coherence: {mean_coherence:.3f}")

# Perplexity (lower is better)
model.transform(texts)  # Required for perplexity calculation
perplexity = model.perplexity_
print(f"Perplexity: {perplexity:.3f}")

Working with Pandas 

The API works seamlessly with pandas DataFrames:

import pandas as pd

df = pd.DataFrame({'text': texts, 'category': ['ML', 'DL', 'NLP', 'AI']})

model = btm.BTMClassifier(n_topics=3)
doc_topics = model.fit_transform(df['text'])

# Add topic predictions to DataFrame
df['dominant_topic'] = doc_topics.argmax(axis=1)
df['topic_confidence'] = doc_topics.max(axis=1)

Tips and Best Practices 

Parameter Selection 

n_topics: Start with 5-10 topics for small datasets, 10-50 for larger ones
alpha: Higher values (1.0+) create more evenly distributed topics
beta: Keep small (0.01-0.1) for focused topics
max_iter: 100-200 usually sufficient for convergence
epsilon: Default (1e-10) works well; increase for extreme numerical stability, decrease for higher precision

Performance Optimization 

Use random_state for reproducible results
Set max_iter lower for faster experimentation
Adjust vectorizer_params to control vocabulary size
For large datasets, consider increasing min_df to reduce vocabulary

Topic Quality 

Check coherence scores - higher is generally better
Examine top words per topic for interpretability
Use get_document_topics() to see topic assignments
Compare different n_topics values using coherence

Common Issues 

Import Errors: Make sure Cython extensions are built: python setup.py build_ext --inplace
Empty Topics: Reduce n_topics or adjust vectorizer_params (lower min_df)
Poor Topic Quality: Try different alpha/beta values or increase max_iter
Memory Issues: Increase min_df to reduce vocabulary size for large datasets

Migration Guide 

Converting from Original API 

Old code:

# Original bitermplus API
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)

model = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=600)
p_zd = model.transform(docs_vec)

New code:

# New sklearn-style API
model = btm.BTMClassifier(
    n_topics=8,
    random_state=42,
    coherence_window=20,
    alpha=50/8,
    beta=0.01,
    max_iter=600,
    epsilon=1e-10  # Numerical stability parameter
)
p_zd = model.fit_transform(texts)

The new API handles all preprocessing automatically while providing the same underlying functionality with much simpler usage.