Tutorial ======== Model fitting ------------- This example demonstrates basic model fitting. Prerequisite: your documents should be preprocessed (cleaned, lemmatized/stemmed, and stop words removed). .. code-block:: python import bitermplus as btm import numpy as np import pandas as pd # Importing data df = pd.read_csv( 'dataset/SearchSnippets.txt.gz', header=None, names=['texts']) texts = df['texts'].str.strip().tolist() # Vectorize documents and extract biterms # Uses sklearn's CountVectorizer internally - accepts its parameters # Example: stop word removal stop_words = ["word1", "word2", "word3"] X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words) docs_vec = btm.get_vectorized_docs(texts, vocabulary) biterms = btm.get_biterms(docs_vec) # Initializing and running model model = btm.BTM( X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01) model.fit(biterms, iterations=20) Inference --------- Calculate document-topic probability matrix (inference): .. code-block:: python p_zd = model.transform(docs_vec) For inference on new documents, vectorize using the training vocabulary: .. code-block:: python new_docs_vec = btm.get_vectorized_docs(new_texts, vocabulary) p_zd = model.transform(new_docs_vec) Calculating metrics ------------------- Calculate perplexity using the document-topic probability matrix (``p_zd``) from inference: .. code-block:: python perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8) coherence = btm.coherence(model.matrix_topics_words_, X, M=20) # or perplexity = model.perplexity_ coherence = model.coherence_ Visualizing results ------------------- Visualize results using the `tmplot `_ package: .. code-block:: python import tmplot as tmp # Run the interactive report interface tmp.report(model=model, docs=texts) Filtering stable topics ----------------------- Topic models suffer from instability across runs [1]_ [2]_ [3]_. The ``tmplot`` package provides methods to identify stable topics using distance metrics: Kullback-Leibler divergence, Hellinger distance, Jeffrey's divergence, Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, and Total variation distance. .. code-block:: python import pickle as pkl import tmplot as tmp import glob # Loading saved models models_files = sorted(glob.glob(r'results/model[0-9].pkl')) models = [] for fn in models_files: file = open(fn, 'rb') models.append(pkl.load(file)) file.close() # Choosing reference model np.random.seed(122334) reference_model = np.random.randint(1, 6) # Getting close topics close_topics, close_kl = tmp.get_closest_topics( models, method="sklb", ref=reference_model) # Getting stable topics stable_topics, stable_kl = tmp.get_stable_topics( close_topics, close_kl, ref=reference_model, thres=0.7) # Stable topics indices list print(stable_topics[:, reference_model]) Model loading and saving ------------------------ Models support `pickle `_ serialization (since v0.5.3): .. code-block:: python import pickle as pkl # Saving with open("model.pkl", "wb") as file: pkl.dump(model, file) # Loading with open("model.pkl", "rb") as file: model = pkl.load(file) References ---------- .. [1] Koltcov, S., Koltsova, O., & Nikolenko, S. (2014, June). Latent dirichlet allocation: stability and applications to studies of user-generated content. In Proceedings of the 2014 ACM conference on Web science (pp. 161-165). .. [2] Mantyla, M. V., Claes, M., & Farooq, U. (2018, October). Measuring LDA topic stability from clusters of replicated runs. In Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and measurement (pp. 1-4). .. [3] Greene, D., O’Callaghan, D., & Cunningham, P. (2014, September). How many topics? stability analysis for topic models. In Joint European conference on machine learning and knowledge discovery in databases (pp. 498-513). Springer, Berlin, Heidelberg.