Utility functions

bitermplus.get_words_freqs(docs: List[str] | ndarray | Series, **kwargs: dict) → Tuple[csr_matrix, ndarray, Dict]

Compute words vs documents frequency matrix.

Parameters:

docs (Union[List[str], np.ndarray, Series]) – Documents in any format that can be passed to sklearn.feature_extraction.text.CountVectorizer() method.
kwargs (dict) – Keyword arguments for sklearn.feature_extraction.text.CountVectorizer() method.

Returns:

Documents vs words matrix in CSR format, vocabulary as a numpy.ndarray of terms, and vocabulary as a dictionary of {term: id} pairs.

Return type:

Tuple[scipy.sparse.csr_matrix, np.ndarray, Dict]

Example

>>> import pandas as pd
>>> import bitermplus as btm

>>> # Loading data
>>> df = pd.read_csv(
...     'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
>>> texts = df['texts'].str.strip().tolist()

>>> # Vectorizing documents, obtaining full vocabulary and biterms
>>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts)

bitermplus.get_vectorized_docs(docs: List[str] | ndarray, vocab: List[str] | ndarray) → List[ndarray]

Replace words with their ids in each document.

Parameters:

docs (Union[List[str], np.ndarray]) – Documents (iterable of strings).
vocab (Union[List[str], np.ndarray]) – Vocabulary (iterable of terms).

Returns:

docs – Vectorised documents (list of numpy.ndarray objects with terms ids).

Return type:

List[np.ndarray]

Example

>>> import pandas as pd
>>> import bitermplus as btm

>>> # Loading data
>>> df = pd.read_csv(
...     'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
>>> texts = df['texts'].str.strip().tolist()

>>> # Vectorizing documents, obtaining full vocabulary and biterms
>>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> docs_vec = btm.get_vectorized_docs(texts, vocabulary)

bitermplus.get_biterms(docs: List[ndarray], win: int = 15) → List[List[int]]

Biterms creation routine.

Parameters:

docs (List[np.ndarray]) – List of numpy.ndarray objects containing word indices.
win (int = 15) – Biterms generation window.

Returns:

List of biterms for each document.

Return type:

List[List[int]]

Example

>>> import pandas as pd
>>> import bitermplus as btm

>>> # Loading data
>>> df = pd.read_csv(
...     'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
>>> texts = df['texts'].str.strip().tolist()

>>> # Vectorizing documents, obtaining full vocabulary and biterms
>>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
>>> docs_vec = btm.get_vectorized_docs(texts, vocabulary)
>>> biterms = btm.get_biterms(docs_vec)

bitermplus.get_top_topic_words(model: BTM, words_num: int = 20, topics_idx: Sequence[Any] = None) → DataFrame

Select top topic words from a fitted model.

Parameters:

model (bitermplus._btm.BTM) – Fitted BTM model.
words_num (int = 20) – The number of words to select.
topics_idx (Union[List, numpy.ndarray] = None) – Topics indices. Meant to be used to select only stable topics.

Returns:

Words with highest probabilities per each selected topic.

Return type:

DataFrame

Example

>>> stable_topics = [0, 3, 10, 12, 18, 21]
>>> top_words = btm.get_top_topic_words(
...     model,
...     words_num=100,
...     topics_idx=stable_topics)

bitermplus.get_top_topic_docs(docs: Sequence[Any], p_zd: ndarray, docs_num: int = 20, topics_idx: Sequence[Any] = None) → DataFrame

Select top topic docs from a fitted model.

Parameters:

docs (Sequence[Any]) – Iterable of documents (e.g. list of strings).
p_zd (np.ndarray) – Documents vs topics probabilities matrix.
docs_num (int = 20) – The number of documents to select.
topics_idx (Sequence[Any] = None) – Topics indices. Meant to be used to select only stable topics.

Returns:

Documents with highest probabilities in all selected topics.

Return type:

DataFrame

Example

>>> top_docs = btm.get_top_topic_docs(
...     texts,
...     p_zd,
...     docs_num=100,
...     topics_idx=[1,2,3,4])

bitermplus.get_docs_top_topic(docs: Sequence[Any], p_zd: ndarray) → DataFrame

Select most probable topic for each document.

Parameters:

docs (Sequence[Any]) – Iterable of documents (e.g. list of strings).
p_zd (np.ndarray) – Documents vs topics probabilities matrix.

Returns:

Documents and the most probable topic for each of them.

Return type:

DataFrame

Example

>>> import bitermplus as btm
>>> # Read documents from file
>>> # texts = ...
>>> # Build and train a model
>>> # model = ...
>>> # model.fit(...)
>>> btm.get_docs_top_topic(texts, model.matrix_docs_topics_)