Bitermplus implements Biterm topic model
for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi
Cheng. Actually, it is a cythonized version of BTM. This package is also capable of computing
perplexity and semantic coherence metrics.
Here is a simple example of model fitting.
It is supposed that you have already gone through the preprocessing
stage: cleaned, lemmatized or stemmed your documents, and removed stop words.
importbitermplusasbtmimportnumpyasnpimportpandasaspd# Importing datadf=pd.read_csv('dataset/SearchSnippets.txt.gz',header=None,names=['texts'])texts=df['texts'].str.strip().tolist()# Vectorizing documents, obtaining full vocabulary and biterms# Internally, btm.get_words_freqs uses CountVectorizer from sklearn# You can pass any of its arguments to btm.get_words_freqs# For example, you can remove stop words:stop_words=["word1","word2","word3"]X,vocabulary,vocab_dict=btm.get_words_freqs(texts,stop_words=stop_words)docs_vec=btm.get_vectorized_docs(texts,vocabulary)biterms=btm.get_biterms(docs_vec)# Initializing and running modelmodel=btm.BTM(X,vocabulary,seed=12321,T=8,M=20,alpha=50/8,beta=0.01)model.fit(biterms,iterations=20)
Unsupervised topic models (such as LDA) are subject to topic instability [1][2][3]. There is a special method in tmplot package for selecting stable
topics. It uses various distance metrics such as Kullback-Leibler divergence
(symmetric and non-symmetric), Hellinger distance, Jeffrey’s divergence,
Jensen-Shannon divergence, Jaccard index, Bhattacharyya distance, Total
variation distance.
importpickleaspklimporttmplotastmpimportglob# Loading saved modelsmodels_files=sorted(glob.glob(r'results/model[0-9].pkl'))models=[]forfninmodels_files:file=open(fn,'rb')models.append(pkl.load(file))file.close()# Choosing reference modelnp.random.seed(122334)reference_model=np.random.randint(1,6)# Getting close topicsclose_topics,close_kl=tmp.get_closest_topics(models,method="sklb",ref=reference_model)# Getting stable topicsstable_topics,stable_kl=tmp.get_stable_topics(close_topics,close_kl,ref=reference_model,thres=0.7)# Stable topics indices listprint(stable_topics[:,reference_model])
In this section, the results of a series of benchmarks done on SearchSnippets dataset
are presented. Sixteen models were trained with different iterations number
(from 10 to 2000) and default model parameters. Topics number was set to 8.
Semantic topic coherence (u_mass) and perplexity were
calculated for each model.
Run model fitting and return documents vs topics matrix.
Parameters:
docs (list) – Documents list. Each document must be presented as
a list of words ids. Typically, it can be the output of
bitermplus.get_vectorized_docs().
biterms (list) – List of biterms.
infer_type (str) –
Inference type. The following options are available:
sum_b (default).
sum_w.
mix.
iterations (int = 600) – Iterations number.
verbose (bool = True) – Be verbose (show progress bars).
docs (list) – Documents list. Each document must be presented as
a list of words ids. Typically, it can be the output of
bitermplus.get_vectorized_docs().
infer_type (str) –
Inference type. The following options are available:
sum_b (default).
sum_w.
mix.
verbose (bool = True) – Be verbose (show progress bar).
Returns:
p_zd – Documents vs topics probability matrix (D vs T).
Renyi entropy can be used to estimate the optimal number of topics: just fit
several models with a different number of topics and choose the number of
topics for which the Renyi entropy is the least.
Parameters:
p_wz (np.ndarray) – Topics vs words probabilities matrix (T x W).
Returns:
renyi (double) – Renyi entropy value.
max_probs (bool) – Use maximum probabilities of terms per topics instead of all probability values.
References
Example
>>> importbitermplusasbtm>>> # Preprocessing step>>> # ...>>> # Model fitting step>>> # model = ...>>> # Entropy calculation>>> entropy=btm.entropy(model.matrix_topics_words_)