Tutorial
========
Model fitting
-------------
This example demonstrates basic model fitting.
Prerequisite: your documents should be preprocessed (cleaned, lemmatized/stemmed, and stop words removed).
.. code-block:: python
import bitermplus as btm
import numpy as np
import pandas as pd
# Importing data
df = pd.read_csv(
'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()
# Vectorize documents and extract biterms
# Uses sklearn's CountVectorizer internally - accepts its parameters
# Example: stop word removal
stop_words = ["word1", "word2", "word3"]
X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)
# Initializing and running model
model = btm.BTM(
X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
Inference
---------
Calculate document-topic probability matrix (inference):
.. code-block:: python
p_zd = model.transform(docs_vec)
For inference on new documents, vectorize using the training vocabulary:
.. code-block:: python
new_docs_vec = btm.get_vectorized_docs(new_texts, vocabulary)
p_zd = model.transform(new_docs_vec)
Calculating metrics
-------------------
Calculate perplexity using the document-topic probability matrix (``p_zd``) from inference:
.. code-block:: python
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
# or
perplexity = model.perplexity_
coherence = model.coherence_
Visualizing results
-------------------
Visualize results using the `tmplot `_ package:
.. code-block:: python
import tmplot as tmp
# Run the interactive report interface
tmp.report(model=model, docs=texts)
Filtering stable topics
-----------------------
Topic models suffer from instability across runs [1]_ [2]_ [3]_.
The ``tmplot`` package provides methods to identify stable topics using
distance metrics: Kullback-Leibler divergence, Hellinger distance, Jeffrey's divergence,
Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, and Total variation distance.
.. code-block:: python
import pickle as pkl
import tmplot as tmp
import glob
# Loading saved models
models_files = sorted(glob.glob(r'results/model[0-9].pkl'))
models = []
for fn in models_files:
file = open(fn, 'rb')
models.append(pkl.load(file))
file.close()
# Choosing reference model
np.random.seed(122334)
reference_model = np.random.randint(1, 6)
# Getting close topics
close_topics, close_kl = tmp.get_closest_topics(
models, method="sklb", ref=reference_model)
# Getting stable topics
stable_topics, stable_kl = tmp.get_stable_topics(
close_topics, close_kl, ref=reference_model, thres=0.7)
# Stable topics indices list
print(stable_topics[:, reference_model])
Model loading and saving
------------------------
Models support `pickle `_ serialization (since v0.5.3):
.. code-block:: python
import pickle as pkl
# Saving
with open("model.pkl", "wb") as file:
pkl.dump(model, file)
# Loading
with open("model.pkl", "rb") as file:
model = pkl.load(file)
References
----------
.. [1] Koltcov, S., Koltsova, O., & Nikolenko, S. (2014, June).
Latent dirichlet allocation: stability and applications to studies of
user-generated content. In Proceedings of the 2014 ACM conference on Web
science (pp. 161-165).
.. [2] Mantyla, M. V., Claes, M., & Farooq, U. (2018, October).
Measuring LDA topic stability from clusters of replicated runs. In
Proceedings of the 12th ACM/IEEE international symposium on empirical
software engineering and measurement (pp. 1-4).
.. [3] Greene, D., O’Callaghan, D., & Cunningham, P. (2014, September). How many
topics? stability analysis for topic models. In Joint European conference on
machine learning and knowledge discovery in databases (pp. 498-513). Springer,
Berlin, Heidelberg.