Utility functions
- bitermplus.get_words_freqs(docs: List[str] | ndarray | Series, **kwargs: dict) Tuple[csr_matrix, ndarray, Dict][source]
Extract word frequencies and vocabulary from text documents.
This function vectorizes a collection of text documents into a sparse matrix representation suitable for topic modeling. It uses scikit-learn’s CountVectorizer to tokenize, count, and filter words, creating a document-term matrix.
- Parameters:
docs (list of str, numpy.ndarray, or pandas.Series) – Collection of text documents to vectorize. Each element should be a string containing the text content of one document.
**kwargs (dict) –
Additional keyword arguments passed to CountVectorizer. Common options include:
min_df : int or float, minimum document frequency
max_df : int or float, maximum document frequency
stop_words : str or list, stop words to remove
lowercase : bool, whether to convert to lowercase
token_pattern : str, regex pattern for tokenization
- Returns:
doc_term_matrix (scipy.sparse.csr_matrix, shape (n_documents, n_features)) – Sparse matrix where element (i,j) represents the count of term j in document i.
vocabulary (numpy.ndarray, shape (n_features,)) – Array of feature names (words) corresponding to the matrix columns.
vocab_dict (dict) – Dictionary mapping terms to their column indices in the matrix.
Examples
Basic usage:
>>> import bitermplus as btm >>> texts = ["machine learning is great", "I love natural language processing"] >>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts) >>> print(f"Matrix shape: {X.shape}") >>> print(f"Vocabulary size: {len(vocabulary)}")
With custom parameters:
>>> X, vocab, vocab_dict = btm.get_words_freqs( ... texts, min_df=1, stop_words='english', lowercase=True ... )
Notes
This function is primarily used internally by BTMClassifier, but can be useful for manual preprocessing when using the low-level BTM class directly.
See also
get_vectorized_docsConvert documents to word ID representation
get_bitermsGenerate biterms from vectorized documents
sklearn.feature_extraction.text.CountVectorizerUnderlying vectorization method
- bitermplus.get_vectorized_docs(docs: List[str] | ndarray, vocab: List[str] | ndarray) List[ndarray][source]
Convert text documents to vectorized representation using word IDs.
This function transforms raw text documents into a numerical representation where each word is replaced by its corresponding index in the vocabulary. This is a preprocessing step required before biterm generation and BTM training.
- Parameters:
docs (list of str or numpy.ndarray) – Collection of text documents. Each document should be a string.
vocab (list of str or numpy.ndarray) – Vocabulary array containing all unique terms. Typically obtained from get_words_freqs() function.
- Returns:
vectorized_docs – List of vectorized documents. Each document is represented as a numpy array of word IDs (integers) corresponding to vocabulary indices. Words not in the vocabulary are filtered out.
- Return type:
list of numpy.ndarray
Examples
Basic usage:
>>> import bitermplus as btm >>> texts = ["machine learning is great", "I love deep learning"] >>> X, vocabulary, _ = btm.get_words_freqs(texts) >>> docs_vec = btm.get_vectorized_docs(texts, vocabulary) >>> print(f"Original: {texts[0]}") >>> print(f"Vectorized: {docs_vec[0]}")
Complete preprocessing pipeline:
>>> texts = ["AI and ML are exciting", "Deep learning transforms data"] >>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts) >>> docs_vectorized = btm.get_vectorized_docs(texts, vocabulary) >>> biterms = btm.get_biterms(docs_vectorized)
Notes
Documents are split on whitespace and filtered to include only known vocabulary
Empty strings and None values are handled gracefully
This function is automatically called by BTMClassifier but useful for manual preprocessing
See also
get_words_freqsExtract vocabulary and document-term matrix
get_bitermsGenerate biterms from vectorized documents
BTMClassifierHigh-level interface that handles preprocessing automatically
- bitermplus.get_biterms(docs: List[ndarray], win: int = 15) List[List[int]][source]
Generate biterms (word pairs) from vectorized documents.
Biterms are word co-occurrence pairs that capture local word associations within a specified window. This is the core data structure used by BTM to model topics in short texts. Unlike traditional topic models that work with individual documents, BTM aggregates biterms across the entire corpus.
- Parameters:
docs (list of numpy.ndarray) – List of vectorized documents where each document is a numpy array of word IDs. Typically obtained from get_vectorized_docs() function.
win (int, default=15) – Window size for biterm extraction. Biterms are created from all word pairs within this distance in each document. Larger windows capture more long-range dependencies but may introduce noise.
- Returns:
biterms – Nested list structure where biterms[i] contains all biterms for document i. Each biterm is represented as [word_id1, word_id2] where word_id1 <= word_id2.
- Return type:
list of list of list
- Raises:
ValueError – If no biterms can be generated from the input documents (e.g., all documents are too short or vocabulary overlap is insufficient).
Examples
Basic usage:
>>> import bitermplus as btm >>> texts = ["machine learning algorithms", "deep learning networks"] >>> X, vocabulary, _ = btm.get_words_freqs(texts) >>> docs_vec = btm.get_vectorized_docs(texts, vocabulary) >>> biterms = btm.get_biterms(docs_vec) >>> print(f"Number of documents: {len(biterms)}") >>> print(f"Biterms in first doc: {biterms[0]}")
With custom window size:
>>> biterms = btm.get_biterms(docs_vec, win=10)
Complete preprocessing pipeline:
>>> texts = ["AI and machine learning", "Natural language processing"] >>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts) >>> docs_vec = btm.get_vectorized_docs(texts, vocabulary) >>> biterms = btm.get_biterms(docs_vec, win=15) >>> # Now ready for BTM training >>> model = btm.BTM(X, vocabulary, T=2) >>> model.fit(biterms)
Notes
Documents with fewer than 2 words produce no biterms and are skipped
Biterms are ordered such that the smaller word ID comes first
The function validates that at least some biterms are generated
Window size should be chosen based on document length and desired dependencies
See also
get_vectorized_docsConvert documents to word ID representation
BTM.fitFit BTM model using generated biterms
BTMClassifierHigh-level interface that handles biterm generation automatically
- bitermplus.get_top_topic_words(model: BTM, words_num: int = 20, topics_idx: Sequence[Any] = None) DataFrame[source]
Select top topic words from a fitted model.
- Parameters:
model (bitermplus._btm.BTM) – Fitted BTM model.
words_num (int = 20) – The number of words to select.
topics_idx (Union[List, numpy.ndarray] = None) – Topics indices. Meant to be used to select only stable topics.
- Returns:
Words with highest probabilities per each selected topic.
- Return type:
DataFrame
Example
>>> stable_topics = [0, 3, 10, 12, 18, 21] >>> top_words = btm.get_top_topic_words( ... model, ... words_num=100, ... topics_idx=stable_topics)
- bitermplus.get_top_topic_docs(docs: Sequence[Any], p_zd: ndarray, docs_num: int = 20, topics_idx: Sequence[Any] = None) DataFrame[source]
Select top topic docs from a fitted model.
- Parameters:
docs (Sequence[Any]) – Iterable of documents (e.g. list of strings).
p_zd (np.ndarray) – Documents vs topics probabilities matrix.
docs_num (int = 20) – The number of documents to select.
topics_idx (Sequence[Any] = None) – Topics indices. Meant to be used to select only stable topics.
- Returns:
Documents with highest probabilities in all selected topics.
- Return type:
DataFrame
Example
>>> top_docs = btm.get_top_topic_docs( ... texts, ... p_zd, ... docs_num=100, ... topics_idx=[1,2,3,4])
- bitermplus.get_docs_top_topic(docs: Sequence[Any], p_zd: ndarray) DataFrame[source]
Select most probable topic for each document.
- Parameters:
docs (Sequence[Any]) – Iterable of documents (e.g. list of strings).
p_zd (np.ndarray) – Documents vs topics probabilities matrix.
- Returns:
Documents and the most probable topic for each of them.
- Return type:
DataFrame
Example
>>> import bitermplus as btm >>> # Read documents from file >>> # texts = ... >>> # Build and train a model >>> # model = ... >>> # model.fit(...) >>> btm.get_docs_top_topic(texts, model.matrix_docs_topics_)