Sklearn-style API
The bitermplus package now includes a sklearn-compatible API that makes topic modeling much easier and more intuitive. The BTMClassifier class provides a familiar interface for scikit-learn users and integrates seamlessly with ML pipelines.
Quick Start
The new API reduces complex topic modeling workflows to just a few lines:
import bitermplus as btm
# Sample documents
texts = [
"machine learning algorithms are powerful",
"deep learning neural networks process data",
"natural language processing understands text",
"artificial intelligence transforms industries"
]
# Create and fit model (one step!)
model = btm.BTMClassifier(n_topics=2, random_state=42)
model.fit(texts)
# Get topic distributions
doc_topics = model.transform(texts)
print(f"Document-topic matrix shape: {doc_topics.shape}")
# Interpret topics
topic_words = model.get_topic_words(n_words=5)
for topic_id, words in topic_words.items():
print(f"Topic {topic_id}: {', '.join(words)}")
API Comparison
Traditional API (complex, multi-step):
# Multiple manual preprocessing steps
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)
# Model creation and fitting
model = btm.BTM(X, vocabulary, seed=42, T=3, M=20, alpha=50/3, beta=0.01)
model.fit(biterms, iterations=100)
# Inference
p_zd = model.transform(docs_vec)
New Sklearn API (simple, one-liner):
# Everything in one step!
model = btm.BTMClassifier(n_topics=3, random_state=42)
doc_topics = model.fit_transform(texts)
BTMClassifier Class
- class bitermplus.BTMClassifier(n_topics: int = 8, alpha: float | None = None, beta: float = 0.01, max_iter: int = 600, random_state: int | None = None, window_size: int = 15, has_background: bool = False, coherence_window: int = 20, vectorizer_params: Dict[str, Any] | None = None, epsilon: float = 1e-10)[source]
Bases:
BaseEstimator,TransformerMixinSklearn-compatible Biterm Topic Model for short text analysis.
This class provides a scikit-learn compatible interface for the Biterm Topic Model, designed specifically for short text analysis such as tweets, reviews, and messages. Unlike traditional topic models like LDA, BTM extracts biterms (word pairs) from the entire corpus to overcome data sparsity issues in short texts.
The BTMClassifier automatically handles text preprocessing, vectorization, biterm generation, model training, and inference, making topic modeling as simple as calling fit() and transform().
- Parameters:
n_topics (int, default=8) – Number of topics to extract from the corpus.
alpha (float, default=None) – Dirichlet prior parameter for topic distribution. Controls topic sparsity in documents. Higher values create more uniform topic distributions. If None, uses 50/n_topics as recommended in the original paper.
beta (float, default=0.01) – Dirichlet prior parameter for word distribution within topics. Controls topic-word sparsity. Lower values create more focused topics.
max_iter (int, default=600) – Maximum number of Gibbs sampling iterations for model training. More iterations generally improve convergence but increase training time.
random_state (int, default=None) – Random seed for reproducible results. Set to an integer for consistent results across runs.
window_size (int, default=15) – Window size for biterm generation. Biterms are extracted from word pairs within this window distance in each document.
has_background (bool, default=False) – Whether to use a background topic to model highly frequent words that appear across many topics (e.g., stop words).
coherence_window (int, default=20) – Number of top words used for coherence calculation. This affects the semantic coherence metric computation.
vectorizer_params (dict, default=None) – Additional parameters to pass to the internal CountVectorizer for text preprocessing. Common options include min_df, max_df, stop_words, etc.
epsilon (float, default=1e-10) – Small numerical constant to prevent division by zero and improve numerical stability in probability calculations.
- vocabulary_
Vocabulary learned from training data (words corresponding to features).
- Type:
numpy.ndarray
- feature_names_out_
Alias for vocabulary_ for sklearn compatibility.
- Type:
numpy.ndarray
- n_features_in_
Number of features (vocabulary size) after preprocessing.
- Type:
int
- vectorizer_
The fitted vectorizer used for text preprocessing.
- Type:
CountVectorizer
- fit_transform(X, y=None, infer_type='sum_b')[source]
Fit model and transform documents in one step.
Examples
Basic usage:
>>> import bitermplus as btm >>> texts = [ ... "machine learning algorithms are powerful", ... "deep learning neural networks process data", ... "natural language processing understands text" ... ] >>> model = btm.BTMClassifier(n_topics=2, random_state=42) >>> model.fit(texts) BTMClassifier(n_topics=2, random_state=42) >>> doc_topics = model.transform(texts) >>> print(f"Shape: {doc_topics.shape}") Shape: (3, 2)
Getting topic words:
>>> topic_words = model.get_topic_words(n_words=5) >>> for topic_id, words in topic_words.items(): ... print(f"Topic {topic_id}: {', '.join(words)}")
Using with sklearn pipelines:
>>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import FunctionTransformer >>> pipeline = Pipeline([ ... ('preprocess', FunctionTransformer(lambda x: [s.lower() for s in x])), ... ('btm', btm.BTMClassifier(n_topics=3, random_state=42)) ... ]) >>> topics = pipeline.fit_transform(texts)
References
Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web (pp. 1445-1456).
See also
BTMLow-level BTM implementation
get_words_freqsExtract word frequencies from documents
get_bitermsGenerate biterms from vectorized documents
- __init__(n_topics: int = 8, alpha: float | None = None, beta: float = 0.01, max_iter: int = 600, random_state: int | None = None, window_size: int = 15, has_background: bool = False, coherence_window: int = 20, vectorizer_params: Dict[str, Any] | None = None, epsilon: float = 1e-10)[source]
- property coherence_: ndarray
Topic coherence scores.
- fit(X: List[str] | Series, y=None, verbose: bool = False)[source]
Fit the BTM model to documents.
- Parameters:
X (array-like of shape (n_documents,)) – Documents to fit the model on. Each element should be a string.
y (Ignored) – Not used, present for sklearn compatibility.
verbose (bool, default=False) – Whether to show a progress bar during training.
- Returns:
self – Returns the instance itself.
- Return type:
- fit_transform(X: List[str] | Series, y=None, infer_type: str = 'sum_b', verbose: bool = False) ndarray[source]
Fit model and transform documents in one step.
- Parameters:
X (array-like of shape (n_documents,)) – Documents to fit and transform.
y (Ignored) – Not used, present for sklearn compatibility.
infer_type (str, default='sum_b') – Inference method. Options: ‘sum_b’, ‘sum_w’, ‘mix’.
verbose (bool, default=False) – Whether to show a progress bar during training.
- Returns:
doc_topic_matrix – Document-topic probability matrix.
- Return type:
np.ndarray of shape (n_documents, n_topics)
- get_document_topics(X: List[str] | Series, threshold: float = 0.1) List[List[int]][source]
Get dominant topics for documents.
- Parameters:
X (array-like of shape (n_documents,)) – Documents to analyze.
threshold (float, default=0.1) – Minimum probability threshold for topic assignment.
- Returns:
doc_topics – For each document, list of topic IDs above threshold.
- Return type:
list of list of int
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params – Parameter names mapped to their values.
- Return type:
dict
- get_topic_words(topic_id: int | None = None, n_words: int = 10) List[str] | Dict[int, List[str]][source]
Get top words for topics.
- Parameters:
topic_id (int, optional) – If provided, return words for this topic only. If None, return words for all topics.
n_words (int, default=10) – Number of top words to return per topic.
- Returns:
topic_words – If topic_id is provided, returns list of top words for that topic. Otherwise, returns dict mapping topic_id to list of words.
- Return type:
list or dict
- property perplexity_: float
Model perplexity.
- score(X: List[str] | Series, y=None) float[source]
Return the mean coherence score.
- Parameters:
X (array-like of shape (n_documents,)) – Documents to score.
y (Ignored) – Not used, present for sklearn compatibility.
- Returns:
score – Mean coherence score across topics.
- Return type:
float
- set_fit_request(*, verbose: bool | None | str = '$UNCHANGED$') BTMClassifier
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
verboseparameter infit.- Returns:
self – The updated object.
- Return type:
object
- set_output(*, transform=None)
Set output container.
See sphx_glr_auto_examples_miscellaneous_plot_set_output.py for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_transform_request(*, infer_type: bool | None | str = '$UNCHANGED$') BTMClassifier
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
infer_type (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
infer_typeparameter intransform.- Returns:
self – The updated object.
- Return type:
object
- property topic_word_matrix_: ndarray
Topic-word probability matrix.
- transform(X: List[str] | Series, infer_type: str = 'sum_b') ndarray[source]
Transform documents to topic distribution.
- Parameters:
X (array-like of shape (n_documents,)) – Documents to transform.
infer_type (str, default='sum_b') – Inference method. Options: ‘sum_b’, ‘sum_w’, ‘mix’.
- Returns:
doc_topic_matrix – Document-topic probability matrix.
- Return type:
np.ndarray of shape (n_documents, n_topics)
Core Methods
The BTMClassifier follows the sklearn estimator interface:
- fit(X, y=None)
Train the BTM model on documents.
- transform(X, infer_type=’sum_b’)
Transform documents to topic probability distributions.
- fit_transform(X, y=None, infer_type=’sum_b’)
Fit model and transform documents in one step.
- score(X, y=None)
Return mean coherence score across topics.
Parameters
- n_topicsint, default=8
Number of topics to extract.
- alphafloat, default=None
Dirichlet prior for topic distribution. If None, uses 50/n_topics.
- betafloat, default=0.01
Dirichlet prior for word distribution.
- max_iterint, default=600
Maximum iterations for model training.
- random_stateint, default=None
Random seed for reproducible results.
- window_sizeint, default=15
Window size for biterm generation.
- vectorizer_paramsdict, default=None
Parameters for the internal CountVectorizer.
- epsilonfloat, default=1e-10
Small numerical constant to prevent division by zero and improve numerical stability.
Topic Analysis Methods
- get_topic_words(topic_id=None, n_words=10)
Get top words for topics. Returns list for single topic or dict for all topics.
- get_document_topics(X, threshold=0.1)
Get dominant topics for documents above probability threshold.
Properties
- coherence_np.ndarray
Topic coherence scores.
- perplexity_float
Model perplexity (requires transform to be called first).
- topic_word_matrix_np.ndarray
Topics × words probability matrix.
- vocabulary_np.ndarray
Learned vocabulary.
- n_features_in_int
Number of features (vocabulary size).
Sklearn Integration
Cross-validation
from sklearn.model_selection import cross_val_score
model = btm.BTMClassifier(n_topics=5, random_state=42)
scores = cross_val_score(model, texts, cv=3)
print(f"Mean coherence: {scores.mean():.3f}")
Pipeline Integration
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
def preprocess_text(texts):
return [text.lower().replace(',', '') for text in texts]
pipeline = Pipeline([
('preprocess', FunctionTransformer(preprocess_text)),
('btm', btm.BTMClassifier(n_topics=3, random_state=42))
])
doc_topics = pipeline.fit_transform(texts)
Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_topics': [3, 5, 8],
'alpha': [0.1, 0.5, 1.0],
'max_iter': [100, 300]
}
grid_search = GridSearchCV(
btm.BTMClassifier(random_state=42),
param_grid,
cv=3,
scoring=None # Uses model's score method
)
grid_search.fit(texts)
best_model = grid_search.best_estimator_
Advanced Usage
Custom Preprocessing
Control text preprocessing with vectorizer_params:
custom_params = {
'min_df': 2, # Minimum document frequency
'max_df': 0.8, # Maximum document frequency
'stop_words': 'english', # Remove English stop words
'lowercase': True, # Convert to lowercase
'token_pattern': r'\b[a-zA-Z]{3,}\b' # Only words 3+ chars
}
model = btm.BTMClassifier(
n_topics=5,
vectorizer_params=custom_params
)
Inference Types
Choose different inference methods:
model = btm.BTMClassifier(n_topics=5)
model.fit(texts)
# Different inference types
topics_sum_b = model.transform(new_texts, infer_type='sum_b') # Default
topics_sum_w = model.transform(new_texts, infer_type='sum_w') # Word-based
topics_mix = model.transform(new_texts, infer_type='mix') # Mixed
Model Evaluation
model = btm.BTMClassifier(n_topics=5, random_state=42)
model.fit(texts)
# Coherence per topic
coherence_scores = model.coherence_
print(f"Topic coherence: {coherence_scores}")
# Overall model quality
mean_coherence = model.score(texts)
print(f"Mean coherence: {mean_coherence:.3f}")
# Perplexity (lower is better)
model.transform(texts) # Required for perplexity calculation
perplexity = model.perplexity_
print(f"Perplexity: {perplexity:.3f}")
Working with Pandas
The API works seamlessly with pandas DataFrames:
import pandas as pd
df = pd.DataFrame({'text': texts, 'category': ['ML', 'DL', 'NLP', 'AI']})
model = btm.BTMClassifier(n_topics=3)
doc_topics = model.fit_transform(df['text'])
# Add topic predictions to DataFrame
df['dominant_topic'] = doc_topics.argmax(axis=1)
df['topic_confidence'] = doc_topics.max(axis=1)
Tips and Best Practices
Parameter Selection
n_topics: Start with 5-10 topics for small datasets, 10-50 for larger ones
alpha: Higher values (1.0+) create more evenly distributed topics
beta: Keep small (0.01-0.1) for focused topics
max_iter: 100-200 usually sufficient for convergence
epsilon: Default (1e-10) works well; increase for extreme numerical stability, decrease for higher precision
Performance Optimization
Use
random_statefor reproducible resultsSet
max_iterlower for faster experimentationAdjust
vectorizer_paramsto control vocabulary sizeFor large datasets, consider increasing
min_dfto reduce vocabulary
Topic Quality
Check coherence scores - higher is generally better
Examine top words per topic for interpretability
Use
get_document_topics()to see topic assignmentsCompare different
n_topicsvalues using coherence
Common Issues
- Import Errors
Make sure Cython extensions are built:
python setup.py build_ext --inplace- Empty Topics
Reduce
n_topicsor adjustvectorizer_params(lowermin_df)- Poor Topic Quality
Try different
alpha/betavalues or increasemax_iter- Memory Issues
Increase
min_dfto reduce vocabulary size for large datasets
Migration Guide
Converting from Original API
Old code:
# Original bitermplus API
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)
model = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=600)
p_zd = model.transform(docs_vec)
New code:
# New sklearn-style API
model = btm.BTMClassifier(
n_topics=8,
random_state=42,
coherence_window=20,
alpha=50/8,
beta=0.01,
max_iter=600,
epsilon=1e-10 # Numerical stability parameter
)
p_zd = model.fit_transform(texts)
The new API handles all preprocessing automatically while providing the same underlying functionality with much simpler usage.