Sklearn-style API

The bitermplus package now includes a sklearn-compatible API that makes topic modeling much easier and more intuitive. The BTMClassifier class provides a familiar interface for scikit-learn users and integrates seamlessly with ML pipelines.

Quick Start

The new API reduces complex topic modeling workflows to just a few lines:

import bitermplus as btm

# Sample documents
texts = [
    "machine learning algorithms are powerful",
    "deep learning neural networks process data",
    "natural language processing understands text",
    "artificial intelligence transforms industries"
]

# Create and fit model (one step!)
model = btm.BTMClassifier(n_topics=2, random_state=42)
model.fit(texts)

# Get topic distributions
doc_topics = model.transform(texts)
print(f"Document-topic matrix shape: {doc_topics.shape}")

# Interpret topics
topic_words = model.get_topic_words(n_words=5)
for topic_id, words in topic_words.items():
    print(f"Topic {topic_id}: {', '.join(words)}")

API Comparison

Traditional API (complex, multi-step):

# Multiple manual preprocessing steps
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)

# Model creation and fitting
model = btm.BTM(X, vocabulary, seed=42, T=3, M=20, alpha=50/3, beta=0.01)
model.fit(biterms, iterations=100)

# Inference
p_zd = model.transform(docs_vec)

New Sklearn API (simple, one-liner):

# Everything in one step!
model = btm.BTMClassifier(n_topics=3, random_state=42)
doc_topics = model.fit_transform(texts)

BTMClassifier Class

class bitermplus.BTMClassifier(n_topics: int = 8, alpha: float | None = None, beta: float = 0.01, max_iter: int = 600, random_state: int | None = None, window_size: int = 15, has_background: bool = False, coherence_window: int = 20, vectorizer_params: Dict[str, Any] | None = None, epsilon: float = 1e-10)[source]

Bases: BaseEstimator, TransformerMixin

Sklearn-compatible Biterm Topic Model for short text analysis.

This class provides a scikit-learn compatible interface for the Biterm Topic Model, designed specifically for short text analysis such as tweets, reviews, and messages. Unlike traditional topic models like LDA, BTM extracts biterms (word pairs) from the entire corpus to overcome data sparsity issues in short texts.

The BTMClassifier automatically handles text preprocessing, vectorization, biterm generation, model training, and inference, making topic modeling as simple as calling fit() and transform().

Parameters:
  • n_topics (int, default=8) – Number of topics to extract from the corpus.

  • alpha (float, default=None) – Dirichlet prior parameter for topic distribution. Controls topic sparsity in documents. Higher values create more uniform topic distributions. If None, uses 50/n_topics as recommended in the original paper.

  • beta (float, default=0.01) – Dirichlet prior parameter for word distribution within topics. Controls topic-word sparsity. Lower values create more focused topics.

  • max_iter (int, default=600) – Maximum number of Gibbs sampling iterations for model training. More iterations generally improve convergence but increase training time.

  • random_state (int, default=None) – Random seed for reproducible results. Set to an integer for consistent results across runs.

  • window_size (int, default=15) – Window size for biterm generation. Biterms are extracted from word pairs within this window distance in each document.

  • has_background (bool, default=False) – Whether to use a background topic to model highly frequent words that appear across many topics (e.g., stop words).

  • coherence_window (int, default=20) – Number of top words used for coherence calculation. This affects the semantic coherence metric computation.

  • vectorizer_params (dict, default=None) – Additional parameters to pass to the internal CountVectorizer for text preprocessing. Common options include min_df, max_df, stop_words, etc.

  • epsilon (float, default=1e-10) – Small numerical constant to prevent division by zero and improve numerical stability in probability calculations.

model_

The fitted BTM model instance containing learned parameters.

Type:

BTM

vocabulary_

Vocabulary learned from training data (words corresponding to features).

Type:

numpy.ndarray

feature_names_out_

Alias for vocabulary_ for sklearn compatibility.

Type:

numpy.ndarray

n_features_in_

Number of features (vocabulary size) after preprocessing.

Type:

int

vectorizer_

The fitted vectorizer used for text preprocessing.

Type:

CountVectorizer

fit(X, y=None)[source]

Fit the BTM model to documents.

transform(X, infer_type='sum_b')[source]

Transform documents to topic probability distributions.

fit_transform(X, y=None, infer_type='sum_b')[source]

Fit model and transform documents in one step.

get_topic_words(topic_id=None, n_words=10)[source]

Get top words for topics.

get_document_topics(X, threshold=0.1)[source]

Get dominant topics for documents.

score(X, y=None)[source]

Return mean coherence score across topics.

Examples

Basic usage:

>>> import bitermplus as btm
>>> texts = [
...     "machine learning algorithms are powerful",
...     "deep learning neural networks process data",
...     "natural language processing understands text"
... ]
>>> model = btm.BTMClassifier(n_topics=2, random_state=42)
>>> model.fit(texts)
BTMClassifier(n_topics=2, random_state=42)
>>> doc_topics = model.transform(texts)
>>> print(f"Shape: {doc_topics.shape}")
Shape: (3, 2)

Getting topic words:

>>> topic_words = model.get_topic_words(n_words=5)
>>> for topic_id, words in topic_words.items():
...     print(f"Topic {topic_id}: {', '.join(words)}")

Using with sklearn pipelines:

>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import FunctionTransformer
>>> pipeline = Pipeline([
...     ('preprocess', FunctionTransformer(lambda x: [s.lower() for s in x])),
...     ('btm', btm.BTMClassifier(n_topics=3, random_state=42))
... ])
>>> topics = pipeline.fit_transform(texts)

References

Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web (pp. 1445-1456).

See also

BTM

Low-level BTM implementation

get_words_freqs

Extract word frequencies from documents

get_biterms

Generate biterms from vectorized documents

__init__(n_topics: int = 8, alpha: float | None = None, beta: float = 0.01, max_iter: int = 600, random_state: int | None = None, window_size: int = 15, has_background: bool = False, coherence_window: int = 20, vectorizer_params: Dict[str, Any] | None = None, epsilon: float = 1e-10)[source]
property coherence_: ndarray

Topic coherence scores.

fit(X: List[str] | Series, y=None, verbose: bool = False)[source]

Fit the BTM model to documents.

Parameters:
  • X (array-like of shape (n_documents,)) – Documents to fit the model on. Each element should be a string.

  • y (Ignored) – Not used, present for sklearn compatibility.

  • verbose (bool, default=False) – Whether to show a progress bar during training.

Returns:

self – Returns the instance itself.

Return type:

BTMClassifier

fit_transform(X: List[str] | Series, y=None, infer_type: str = 'sum_b', verbose: bool = False) ndarray[source]

Fit model and transform documents in one step.

Parameters:
  • X (array-like of shape (n_documents,)) – Documents to fit and transform.

  • y (Ignored) – Not used, present for sklearn compatibility.

  • infer_type (str, default='sum_b') – Inference method. Options: ‘sum_b’, ‘sum_w’, ‘mix’.

  • verbose (bool, default=False) – Whether to show a progress bar during training.

Returns:

doc_topic_matrix – Document-topic probability matrix.

Return type:

np.ndarray of shape (n_documents, n_topics)

get_document_topics(X: List[str] | Series, threshold: float = 0.1) List[List[int]][source]

Get dominant topics for documents.

Parameters:
  • X (array-like of shape (n_documents,)) – Documents to analyze.

  • threshold (float, default=0.1) – Minimum probability threshold for topic assignment.

Returns:

doc_topics – For each document, list of topic IDs above threshold.

Return type:

list of list of int

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

get_topic_words(topic_id: int | None = None, n_words: int = 10) List[str] | Dict[int, List[str]][source]

Get top words for topics.

Parameters:
  • topic_id (int, optional) – If provided, return words for this topic only. If None, return words for all topics.

  • n_words (int, default=10) – Number of top words to return per topic.

Returns:

topic_words – If topic_id is provided, returns list of top words for that topic. Otherwise, returns dict mapping topic_id to list of words.

Return type:

list or dict

property perplexity_: float

Model perplexity.

score(X: List[str] | Series, y=None) float[source]

Return the mean coherence score.

Parameters:
  • X (array-like of shape (n_documents,)) – Documents to score.

  • y (Ignored) – Not used, present for sklearn compatibility.

Returns:

score – Mean coherence score across topics.

Return type:

float

set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') BTMClassifier

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_output(*, transform=None)

Set output container.

See sphx_glr_auto_examples_miscellaneous_plot_set_output.py for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_transform_request(*, infer_type: bool | None | str = '$UNCHANGED$') BTMClassifier

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

infer_type (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for infer_type parameter in transform.

Returns:

self – The updated object.

Return type:

object

property topic_word_matrix_: ndarray

Topic-word probability matrix.

transform(X: List[str] | Series, infer_type: str = 'sum_b') ndarray[source]

Transform documents to topic distribution.

Parameters:
  • X (array-like of shape (n_documents,)) – Documents to transform.

  • infer_type (str, default='sum_b') – Inference method. Options: ‘sum_b’, ‘sum_w’, ‘mix’.

Returns:

doc_topic_matrix – Document-topic probability matrix.

Return type:

np.ndarray of shape (n_documents, n_topics)

Core Methods

The BTMClassifier follows the sklearn estimator interface:

fit(X, y=None)

Train the BTM model on documents.

transform(X, infer_type=’sum_b’)

Transform documents to topic probability distributions.

fit_transform(X, y=None, infer_type=’sum_b’)

Fit model and transform documents in one step.

score(X, y=None)

Return mean coherence score across topics.

Parameters

n_topicsint, default=8

Number of topics to extract.

alphafloat, default=None

Dirichlet prior for topic distribution. If None, uses 50/n_topics.

betafloat, default=0.01

Dirichlet prior for word distribution.

max_iterint, default=600

Maximum iterations for model training.

random_stateint, default=None

Random seed for reproducible results.

window_sizeint, default=15

Window size for biterm generation.

vectorizer_paramsdict, default=None

Parameters for the internal CountVectorizer.

epsilonfloat, default=1e-10

Small numerical constant to prevent division by zero and improve numerical stability.

Topic Analysis Methods

get_topic_words(topic_id=None, n_words=10)

Get top words for topics. Returns list for single topic or dict for all topics.

get_document_topics(X, threshold=0.1)

Get dominant topics for documents above probability threshold.

Properties

coherence_np.ndarray

Topic coherence scores.

perplexity_float

Model perplexity (requires transform to be called first).

topic_word_matrix_np.ndarray

Topics × words probability matrix.

vocabulary_np.ndarray

Learned vocabulary.

n_features_in_int

Number of features (vocabulary size).

Sklearn Integration

Cross-validation

from sklearn.model_selection import cross_val_score

model = btm.BTMClassifier(n_topics=5, random_state=42)
scores = cross_val_score(model, texts, cv=3)
print(f"Mean coherence: {scores.mean():.3f}")

Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

def preprocess_text(texts):
    return [text.lower().replace(',', '') for text in texts]

pipeline = Pipeline([
    ('preprocess', FunctionTransformer(preprocess_text)),
    ('btm', btm.BTMClassifier(n_topics=3, random_state=42))
])

doc_topics = pipeline.fit_transform(texts)

Advanced Usage

Custom Preprocessing

Control text preprocessing with vectorizer_params:

custom_params = {
    'min_df': 2,           # Minimum document frequency
    'max_df': 0.8,         # Maximum document frequency
    'stop_words': 'english',  # Remove English stop words
    'lowercase': True,     # Convert to lowercase
    'token_pattern': r'\b[a-zA-Z]{3,}\b'  # Only words 3+ chars
}

model = btm.BTMClassifier(
    n_topics=5,
    vectorizer_params=custom_params
)

Inference Types

Choose different inference methods:

model = btm.BTMClassifier(n_topics=5)
model.fit(texts)

# Different inference types
topics_sum_b = model.transform(new_texts, infer_type='sum_b')  # Default
topics_sum_w = model.transform(new_texts, infer_type='sum_w')  # Word-based
topics_mix = model.transform(new_texts, infer_type='mix')      # Mixed

Model Evaluation

model = btm.BTMClassifier(n_topics=5, random_state=42)
model.fit(texts)

# Coherence per topic
coherence_scores = model.coherence_
print(f"Topic coherence: {coherence_scores}")

# Overall model quality
mean_coherence = model.score(texts)
print(f"Mean coherence: {mean_coherence:.3f}")

# Perplexity (lower is better)
model.transform(texts)  # Required for perplexity calculation
perplexity = model.perplexity_
print(f"Perplexity: {perplexity:.3f}")

Working with Pandas

The API works seamlessly with pandas DataFrames:

import pandas as pd

df = pd.DataFrame({'text': texts, 'category': ['ML', 'DL', 'NLP', 'AI']})

model = btm.BTMClassifier(n_topics=3)
doc_topics = model.fit_transform(df['text'])

# Add topic predictions to DataFrame
df['dominant_topic'] = doc_topics.argmax(axis=1)
df['topic_confidence'] = doc_topics.max(axis=1)

Tips and Best Practices

Parameter Selection

  • n_topics: Start with 5-10 topics for small datasets, 10-50 for larger ones

  • alpha: Higher values (1.0+) create more evenly distributed topics

  • beta: Keep small (0.01-0.1) for focused topics

  • max_iter: 100-200 usually sufficient for convergence

  • epsilon: Default (1e-10) works well; increase for extreme numerical stability, decrease for higher precision

Performance Optimization

  • Use random_state for reproducible results

  • Set max_iter lower for faster experimentation

  • Adjust vectorizer_params to control vocabulary size

  • For large datasets, consider increasing min_df to reduce vocabulary

Topic Quality

  • Check coherence scores - higher is generally better

  • Examine top words per topic for interpretability

  • Use get_document_topics() to see topic assignments

  • Compare different n_topics values using coherence

Common Issues

Import Errors

Make sure Cython extensions are built: python setup.py build_ext --inplace

Empty Topics

Reduce n_topics or adjust vectorizer_params (lower min_df)

Poor Topic Quality

Try different alpha/beta values or increase max_iter

Memory Issues

Increase min_df to reduce vocabulary size for large datasets

Migration Guide

Converting from Original API

Old code:

# Original bitermplus API
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)

model = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=600)
p_zd = model.transform(docs_vec)

New code:

# New sklearn-style API
model = btm.BTMClassifier(
    n_topics=8,
    random_state=42,
    coherence_window=20,
    alpha=50/8,
    beta=0.01,
    max_iter=600,
    epsilon=1e-10  # Numerical stability parameter
)
p_zd = model.fit_transform(texts)

The new API handles all preprocessing automatically while providing the same underlying functionality with much simpler usage.