Sklearn-style API ================= The bitermplus package now includes a sklearn-compatible API that makes topic modeling much easier and more intuitive. The :class:`BTMClassifier` class provides a familiar interface for scikit-learn users and integrates seamlessly with ML pipelines. .. contents:: :local: :depth: 2 Quick Start ----------- The new API reduces complex topic modeling workflows to just a few lines: .. code-block:: python import bitermplus as btm # Sample documents texts = [ "machine learning algorithms are powerful", "deep learning neural networks process data", "natural language processing understands text", "artificial intelligence transforms industries" ] # Create and fit model (one step!) model = btm.BTMClassifier(n_topics=2, random_state=42) model.fit(texts) # Get topic distributions doc_topics = model.transform(texts) print(f"Document-topic matrix shape: {doc_topics.shape}") # Interpret topics topic_words = model.get_topic_words(n_words=5) for topic_id, words in topic_words.items(): print(f"Topic {topic_id}: {', '.join(words)}") API Comparison -------------- **Traditional API** (complex, multi-step):: # Multiple manual preprocessing steps X, vocabulary, vocab_dict = btm.get_words_freqs(texts) docs_vec = btm.get_vectorized_docs(texts, vocabulary) biterms = btm.get_biterms(docs_vec) # Model creation and fitting model = btm.BTM(X, vocabulary, seed=42, T=3, M=20, alpha=50/3, beta=0.01) model.fit(biterms, iterations=100) # Inference p_zd = model.transform(docs_vec) **New Sklearn API** (simple, one-liner):: # Everything in one step! model = btm.BTMClassifier(n_topics=3, random_state=42) doc_topics = model.fit_transform(texts) BTMClassifier Class ------------------- .. currentmodule:: bitermplus .. autoclass:: BTMClassifier :members: :inherited-members: :show-inheritance: .. automethod:: __init__ Core Methods ~~~~~~~~~~~~ The :class:`BTMClassifier` follows the sklearn estimator interface: **fit(X, y=None)** Train the BTM model on documents. **transform(X, infer_type='sum_b')** Transform documents to topic probability distributions. **fit_transform(X, y=None, infer_type='sum_b')** Fit model and transform documents in one step. **score(X, y=None)** Return mean coherence score across topics. Parameters ~~~~~~~~~~ **n_topics** : int, default=8 Number of topics to extract. **alpha** : float, default=None Dirichlet prior for topic distribution. If None, uses 50/n_topics. **beta** : float, default=0.01 Dirichlet prior for word distribution. **max_iter** : int, default=600 Maximum iterations for model training. **random_state** : int, default=None Random seed for reproducible results. **window_size** : int, default=15 Window size for biterm generation. **vectorizer_params** : dict, default=None Parameters for the internal CountVectorizer. **epsilon** : float, default=1e-10 Small numerical constant to prevent division by zero and improve numerical stability. Topic Analysis Methods ~~~~~~~~~~~~~~~~~~~~~~ **get_topic_words(topic_id=None, n_words=10)** Get top words for topics. Returns list for single topic or dict for all topics. **get_document_topics(X, threshold=0.1)** Get dominant topics for documents above probability threshold. Properties ~~~~~~~~~~ **coherence_** : np.ndarray Topic coherence scores. **perplexity_** : float Model perplexity (requires transform to be called first). **topic_word_matrix_** : np.ndarray Topics × words probability matrix. **vocabulary_** : np.ndarray Learned vocabulary. **n_features_in_** : int Number of features (vocabulary size). Sklearn Integration ------------------- Cross-validation ~~~~~~~~~~~~~~~~ .. code-block:: python from sklearn.model_selection import cross_val_score model = btm.BTMClassifier(n_topics=5, random_state=42) scores = cross_val_score(model, texts, cv=3) print(f"Mean coherence: {scores.mean():.3f}") Pipeline Integration ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.preprocessing import FunctionTransformer def preprocess_text(texts): return [text.lower().replace(',', '') for text in texts] pipeline = Pipeline([ ('preprocess', FunctionTransformer(preprocess_text)), ('btm', btm.BTMClassifier(n_topics=3, random_state=42)) ]) doc_topics = pipeline.fit_transform(texts) Grid Search ~~~~~~~~~~~ .. code-block:: python from sklearn.model_selection import GridSearchCV param_grid = { 'n_topics': [3, 5, 8], 'alpha': [0.1, 0.5, 1.0], 'max_iter': [100, 300] } grid_search = GridSearchCV( btm.BTMClassifier(random_state=42), param_grid, cv=3, scoring=None # Uses model's score method ) grid_search.fit(texts) best_model = grid_search.best_estimator_ Advanced Usage -------------- Custom Preprocessing ~~~~~~~~~~~~~~~~~~~~ Control text preprocessing with ``vectorizer_params``: .. code-block:: python custom_params = { 'min_df': 2, # Minimum document frequency 'max_df': 0.8, # Maximum document frequency 'stop_words': 'english', # Remove English stop words 'lowercase': True, # Convert to lowercase 'token_pattern': r'\b[a-zA-Z]{3,}\b' # Only words 3+ chars } model = btm.BTMClassifier( n_topics=5, vectorizer_params=custom_params ) Inference Types ~~~~~~~~~~~~~~~ Choose different inference methods: .. code-block:: python model = btm.BTMClassifier(n_topics=5) model.fit(texts) # Different inference types topics_sum_b = model.transform(new_texts, infer_type='sum_b') # Default topics_sum_w = model.transform(new_texts, infer_type='sum_w') # Word-based topics_mix = model.transform(new_texts, infer_type='mix') # Mixed Model Evaluation ~~~~~~~~~~~~~~~~ .. code-block:: python model = btm.BTMClassifier(n_topics=5, random_state=42) model.fit(texts) # Coherence per topic coherence_scores = model.coherence_ print(f"Topic coherence: {coherence_scores}") # Overall model quality mean_coherence = model.score(texts) print(f"Mean coherence: {mean_coherence:.3f}") # Perplexity (lower is better) model.transform(texts) # Required for perplexity calculation perplexity = model.perplexity_ print(f"Perplexity: {perplexity:.3f}") Working with Pandas ~~~~~~~~~~~~~~~~~~~ The API works seamlessly with pandas DataFrames: .. code-block:: python import pandas as pd df = pd.DataFrame({'text': texts, 'category': ['ML', 'DL', 'NLP', 'AI']}) model = btm.BTMClassifier(n_topics=3) doc_topics = model.fit_transform(df['text']) # Add topic predictions to DataFrame df['dominant_topic'] = doc_topics.argmax(axis=1) df['topic_confidence'] = doc_topics.max(axis=1) Tips and Best Practices ----------------------- Parameter Selection ~~~~~~~~~~~~~~~~~~~ - **n_topics**: Start with 5-10 topics for small datasets, 10-50 for larger ones - **alpha**: Higher values (1.0+) create more evenly distributed topics - **beta**: Keep small (0.01-0.1) for focused topics - **max_iter**: 100-200 usually sufficient for convergence - **epsilon**: Default (1e-10) works well; increase for extreme numerical stability, decrease for higher precision Performance Optimization ~~~~~~~~~~~~~~~~~~~~~~~~~ - Use ``random_state`` for reproducible results - Set ``max_iter`` lower for faster experimentation - Adjust ``vectorizer_params`` to control vocabulary size - For large datasets, consider increasing ``min_df`` to reduce vocabulary Topic Quality ~~~~~~~~~~~~~ - Check coherence scores - higher is generally better - Examine top words per topic for interpretability - Use ``get_document_topics()`` to see topic assignments - Compare different ``n_topics`` values using coherence Common Issues ------------- **Import Errors** Make sure Cython extensions are built: ``python setup.py build_ext --inplace`` **Empty Topics** Reduce ``n_topics`` or adjust ``vectorizer_params`` (lower ``min_df``) **Poor Topic Quality** Try different ``alpha``/``beta`` values or increase ``max_iter`` **Memory Issues** Increase ``min_df`` to reduce vocabulary size for large datasets Migration Guide --------------- Converting from Original API ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Old code:** .. code-block:: python # Original bitermplus API X, vocabulary, vocab_dict = btm.get_words_freqs(texts) docs_vec = btm.get_vectorized_docs(texts, vocabulary) biterms = btm.get_biterms(docs_vec) model = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01) model.fit(biterms, iterations=600) p_zd = model.transform(docs_vec) **New code:** .. code-block:: python # New sklearn-style API model = btm.BTMClassifier( n_topics=8, random_state=42, coherence_window=20, alpha=50/8, beta=0.01, max_iter=600, epsilon=1e-10 # Numerical stability parameter ) p_zd = model.fit_transform(texts) The new API handles all preprocessing automatically while providing the same underlying functionality with much simpler usage.