Utility functions
- bitermplus.get_words_freqs(docs: List[str] | ndarray | Series, **kwargs: dict) Tuple[csr_matrix, ndarray, Dict]
Compute words vs documents frequency matrix.
- Parameters:
docs (Union[List[str], np.ndarray, Series]) – Documents in any format that can be passed to
sklearn.feature_extraction.text.CountVectorizer()
method.kwargs (dict) – Keyword arguments for
sklearn.feature_extraction.text.CountVectorizer()
method.
- Returns:
Documents vs words matrix in CSR format, vocabulary as a numpy.ndarray of terms, and vocabulary as a dictionary of {term: id} pairs.
- Return type:
Tuple[scipy.sparse.csr_matrix, np.ndarray, Dict]
Example
>>> import pandas as pd >>> import bitermplus as btm
>>> # Loading data >>> df = pd.read_csv( ... 'dataset/SearchSnippets.txt.gz', header=None, names=['texts']) >>> texts = df['texts'].str.strip().tolist()
>>> # Vectorizing documents, obtaining full vocabulary and biterms >>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
- bitermplus.get_vectorized_docs(docs: List[str] | ndarray, vocab: List[str] | ndarray) List[ndarray]
Replace words with their ids in each document.
- Parameters:
docs (Union[List[str], np.ndarray]) – Documents (iterable of strings).
vocab (Union[List[str], np.ndarray]) – Vocabulary (iterable of terms).
- Returns:
docs – Vectorised documents (list of
numpy.ndarray
objects with terms ids).- Return type:
List[np.ndarray]
Example
>>> import pandas as pd >>> import bitermplus as btm
>>> # Loading data >>> df = pd.read_csv( ... 'dataset/SearchSnippets.txt.gz', header=None, names=['texts']) >>> texts = df['texts'].str.strip().tolist()
>>> # Vectorizing documents, obtaining full vocabulary and biterms >>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts) >>> docs_vec = btm.get_vectorized_docs(texts, vocabulary)
- bitermplus.get_biterms(docs: List[ndarray], win: int = 15) List[List[int]]
Biterms creation routine.
- Parameters:
docs (List[np.ndarray]) – List of numpy.ndarray objects containing word indices.
win (int = 15) – Biterms generation window.
- Returns:
List of biterms for each document.
- Return type:
List[List[int]]
Example
>>> import pandas as pd >>> import bitermplus as btm
>>> # Loading data >>> df = pd.read_csv( ... 'dataset/SearchSnippets.txt.gz', header=None, names=['texts']) >>> texts = df['texts'].str.strip().tolist()
>>> # Vectorizing documents, obtaining full vocabulary and biterms >>> X, vocabulary, vocab_dict = btm.get_words_freqs(texts) >>> docs_vec = btm.get_vectorized_docs(texts, vocabulary) >>> biterms = btm.get_biterms(docs_vec)
- bitermplus.get_top_topic_words(model: BTM, words_num: int = 20, topics_idx: Sequence[Any] = None) DataFrame
Select top topic words from a fitted model.
- Parameters:
model (bitermplus._btm.BTM) – Fitted BTM model.
words_num (int = 20) – The number of words to select.
topics_idx (Union[List, numpy.ndarray] = None) – Topics indices. Meant to be used to select only stable topics.
- Returns:
Words with highest probabilities per each selected topic.
- Return type:
DataFrame
Example
>>> stable_topics = [0, 3, 10, 12, 18, 21] >>> top_words = btm.get_top_topic_words( ... model, ... words_num=100, ... topics_idx=stable_topics)
- bitermplus.get_top_topic_docs(docs: Sequence[Any], p_zd: ndarray, docs_num: int = 20, topics_idx: Sequence[Any] = None) DataFrame
Select top topic docs from a fitted model.
- Parameters:
docs (Sequence[Any]) – Iterable of documents (e.g. list of strings).
p_zd (np.ndarray) – Documents vs topics probabilities matrix.
docs_num (int = 20) – The number of documents to select.
topics_idx (Sequence[Any] = None) – Topics indices. Meant to be used to select only stable topics.
- Returns:
Documents with highest probabilities in all selected topics.
- Return type:
DataFrame
Example
>>> top_docs = btm.get_top_topic_docs( ... texts, ... p_zd, ... docs_num=100, ... topics_idx=[1,2,3,4])
- bitermplus.get_docs_top_topic(docs: Sequence[Any], p_zd: ndarray) DataFrame
Select most probable topic for each document.
- Parameters:
docs (Sequence[Any]) – Iterable of documents (e.g. list of strings).
p_zd (np.ndarray) – Documents vs topics probabilities matrix.
- Returns:
Documents and the most probable topic for each of them.
- Return type:
DataFrame
Example
>>> import bitermplus as btm >>> # Read documents from file >>> # texts = ... >>> # Build and train a model >>> # model = ... >>> # model.fit(...) >>> btm.get_docs_top_topic(texts, model.matrix_docs_topics_)