Utility functions

bitermplus.get_words_freqs(docs: List[str] | ndarray | Series, **kwargs: dict) Tuple[csr_matrix, ndarray, Dict][source]

Extract word frequencies and vocabulary from text documents.

This function vectorizes a collection of text documents into a sparse matrix representation suitable for topic modeling. It uses scikit-learn’s CountVectorizer to tokenize, count, and filter words, creating a document-term matrix.

Parameters:
  • docs (list of str, numpy.ndarray, or pandas.Series) – Collection of text documents to vectorize. Each element should be a string containing the text content of one document.

  • **kwargs (dict) –

    Additional keyword arguments passed to CountVectorizer. Common options include:

    • min_df : int or float, minimum document frequency

    • max_df : int or float, maximum document frequency

    • stop_words : str or list, stop words to remove

    • lowercase : bool, whether to convert to lowercase

    • token_pattern : str, regex pattern for tokenization

Returns:

  • doc_term_matrix (scipy.sparse.csr_matrix, shape (n_documents, n_features)) – Sparse matrix where element (i,j) represents the count of term j in document i.

  • vocabulary (numpy.ndarray, shape (n_features,)) – Array of feature names (words) corresponding to the matrix columns.

  • vocab_dict (dict) – Dictionary mapping terms to their column indices in the matrix.

Examples

Basic usage:

>>> import bitermplus as btm
>>> texts = ["machine learning is great", "I love natural language processing"]
>>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> print(f"Matrix shape: {X.shape}")
>>> print(f"Vocabulary size: {len(vocabulary)}")

With custom parameters:

>>> X, vocab, vocab_dict = btm.get_words_freqs(
...     texts, min_df=1, stop_words='english', lowercase=True
... )

Notes

This function is primarily used internally by BTMClassifier, but can be useful for manual preprocessing when using the low-level BTM class directly.

See also

get_vectorized_docs

Convert documents to word ID representation

get_biterms

Generate biterms from vectorized documents

sklearn.feature_extraction.text.CountVectorizer

Underlying vectorization method

bitermplus.get_vectorized_docs(docs: List[str] | ndarray, vocab: List[str] | ndarray) List[ndarray][source]

Convert text documents to vectorized representation using word IDs.

This function transforms raw text documents into a numerical representation where each word is replaced by its corresponding index in the vocabulary. This is a preprocessing step required before biterm generation and BTM training.

Parameters:
  • docs (list of str or numpy.ndarray) – Collection of text documents. Each document should be a string.

  • vocab (list of str or numpy.ndarray) – Vocabulary array containing all unique terms. Typically obtained from get_words_freqs() function.

Returns:

vectorized_docs – List of vectorized documents. Each document is represented as a numpy array of word IDs (integers) corresponding to vocabulary indices. Words not in the vocabulary are filtered out.

Return type:

list of numpy.ndarray

Examples

Basic usage:

>>> import bitermplus as btm
>>> texts = ["machine learning is great", "I love deep learning"]
>>> X, vocabulary, _ = btm.get_words_freqs(texts)
>>> docs_vec = btm.get_vectorized_docs(texts, vocabulary)
>>> print(f"Original: {texts[0]}")
>>> print(f"Vectorized: {docs_vec[0]}")

Complete preprocessing pipeline:

>>> texts = ["AI and ML are exciting", "Deep learning transforms data"]
>>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> docs_vectorized = btm.get_vectorized_docs(texts, vocabulary)
>>> biterms = btm.get_biterms(docs_vectorized)

Notes

  • Documents are split on whitespace and filtered to include only known vocabulary

  • Empty strings and None values are handled gracefully

  • This function is automatically called by BTMClassifier but useful for manual preprocessing

See also

get_words_freqs

Extract vocabulary and document-term matrix

get_biterms

Generate biterms from vectorized documents

BTMClassifier

High-level interface that handles preprocessing automatically

bitermplus.get_biterms(docs: List[ndarray], win: int = 15) List[List[int]][source]

Generate biterms (word pairs) from vectorized documents.

Biterms are word co-occurrence pairs that capture local word associations within a specified window. This is the core data structure used by BTM to model topics in short texts. Unlike traditional topic models that work with individual documents, BTM aggregates biterms across the entire corpus.

Parameters:
  • docs (list of numpy.ndarray) – List of vectorized documents where each document is a numpy array of word IDs. Typically obtained from get_vectorized_docs() function.

  • win (int, default=15) – Window size for biterm extraction. Biterms are created from all word pairs within this distance in each document. Larger windows capture more long-range dependencies but may introduce noise.

Returns:

biterms – Nested list structure where biterms[i] contains all biterms for document i. Each biterm is represented as [word_id1, word_id2] where word_id1 <= word_id2.

Return type:

list of list of list

Raises:

ValueError – If no biterms can be generated from the input documents (e.g., all documents are too short or vocabulary overlap is insufficient).

Examples

Basic usage:

>>> import bitermplus as btm
>>> texts = ["machine learning algorithms", "deep learning networks"]
>>> X, vocabulary, _ = btm.get_words_freqs(texts)
>>> docs_vec = btm.get_vectorized_docs(texts, vocabulary)
>>> biterms = btm.get_biterms(docs_vec)
>>> print(f"Number of documents: {len(biterms)}")
>>> print(f"Biterms in first doc: {biterms[0]}")

With custom window size:

>>> biterms = btm.get_biterms(docs_vec, win=10)

Complete preprocessing pipeline:

>>> texts = ["AI and machine learning", "Natural language processing"]
>>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> docs_vec = btm.get_vectorized_docs(texts, vocabulary)
>>> biterms = btm.get_biterms(docs_vec, win=15)
>>> # Now ready for BTM training
>>> model = btm.BTM(X, vocabulary, T=2)
>>> model.fit(biterms)

Notes

  • Documents with fewer than 2 words produce no biterms and are skipped

  • Biterms are ordered such that the smaller word ID comes first

  • The function validates that at least some biterms are generated

  • Window size should be chosen based on document length and desired dependencies

See also

get_vectorized_docs

Convert documents to word ID representation

BTM.fit

Fit BTM model using generated biterms

BTMClassifier

High-level interface that handles biterm generation automatically

bitermplus.get_top_topic_words(model: BTM, words_num: int = 20, topics_idx: Sequence[Any] = None) DataFrame[source]

Select top topic words from a fitted model.

Parameters:
  • model (bitermplus._btm.BTM) – Fitted BTM model.

  • words_num (int = 20) – The number of words to select.

  • topics_idx (Union[List, numpy.ndarray] = None) – Topics indices. Meant to be used to select only stable topics.

Returns:

Words with highest probabilities per each selected topic.

Return type:

DataFrame

Example

>>> stable_topics = [0, 3, 10, 12, 18, 21]
>>> top_words = btm.get_top_topic_words(
...     model,
...     words_num=100,
...     topics_idx=stable_topics)
bitermplus.get_top_topic_docs(docs: Sequence[Any], p_zd: ndarray, docs_num: int = 20, topics_idx: Sequence[Any] = None) DataFrame[source]

Select top topic docs from a fitted model.

Parameters:
  • docs (Sequence[Any]) – Iterable of documents (e.g. list of strings).

  • p_zd (np.ndarray) – Documents vs topics probabilities matrix.

  • docs_num (int = 20) – The number of documents to select.

  • topics_idx (Sequence[Any] = None) – Topics indices. Meant to be used to select only stable topics.

Returns:

Documents with highest probabilities in all selected topics.

Return type:

DataFrame

Example

>>> top_docs = btm.get_top_topic_docs(
...     texts,
...     p_zd,
...     docs_num=100,
...     topics_idx=[1,2,3,4])
bitermplus.get_docs_top_topic(docs: Sequence[Any], p_zd: ndarray) DataFrame[source]

Select most probable topic for each document.

Parameters:
  • docs (Sequence[Any]) – Iterable of documents (e.g. list of strings).

  • p_zd (np.ndarray) – Documents vs topics probabilities matrix.

Returns:

Documents and the most probable topic for each of them.

Return type:

DataFrame

Example

>>> import bitermplus as btm
>>> # Read documents from file
>>> # texts = ...
>>> # Build and train a model
>>> # model = ...
>>> # model.fit(...)
>>> btm.get_docs_top_topic(texts, model.matrix_docs_topics_)