Model
- class bitermplus.BTM(n_dw, vocabulary, int T, int M=20, double alpha=1., double beta=0.01, unsigned int seed=0, int win=15, bool has_background=False)
Biterm Topic Model.
- Parameters:
n_dw (csr.csr_matrix) – Documents vs words frequency matrix. Typically, it should be the output of CountVectorizer from sklearn package.
vocabulary (list) – Vocabulary (a list of words).
T (int) – Number of topics.
M (int = 20) – Number of top words for coherence calculation.
alpha (float = 1) – Model parameter.
beta (float = 0.01) – Model parameter.
seed (int = 0) – Random state seed. If seed is equal to 0 (default), use
time(NULL)
.win (int = 15) – Biterms generation window.
has_background (bool = False) – Use a background topic to accumulate highly frequent words.
- alpha_
float Model parameter.
- Type:
- biterms_
np.ndarray Model biterms. Terms are coded with the corresponding ids.
- Type:
- coherence_
np.ndarray Semantic topics coherence.
- Type:
- coherence_window_
int Number of top words for coherence calculation.
- Type:
- df_words_topics_
DataFrame Words vs topics probabilities in a DataFrame.
- Type:
- fit(self, list Bs, int iterations=600, bool verbose=True)
Biterm topic model fitting method.
- Parameters:
Bs (list) – Biterms list.
iterations (int = 600) – Iterations number.
verbose (bool = True) – Show progress bar.
- fit_transform(self, docs, list biterms, unicode infer_type=u'sum_b', int iterations=600, bool verbose=True)
Run model fitting and return documents vs topics matrix.
- Parameters:
docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of
bitermplus.get_vectorized_docs()
.biterms (list) – List of biterms.
infer_type (str) –
Inference type. The following options are available:
sum_b
(default).sum_w
.mix
.
iterations (int = 600) – Iterations number.
verbose (bool = True) – Be verbose (show progress bars).
- Returns:
p_zd – Documents vs topics matrix (D x T).
- Return type:
np.ndarray
- has_background_
bool Specifies whether the model has a background topic
to accumulate highly frequent words.
- Type:
- iterations_
int Number of iterations the model fitting process has
gone through.
- Type:
- labels_
np.ndarray Model document labels (most probable topic for each document).
- Type:
- matrix_docs_topics_
np.ndarray Documents vs topics probabilities matrix.
- Type:
- matrix_topics_docs_
np.ndarray Topics vs documents probabilities matrix.
- Type:
- matrix_topics_words_
np.ndarray Topics vs words probabilities matrix.
- Type:
- matrix_words_topics_
np.ndarray Words vs topics probabilities matrix.
- Type:
- perplexity_
float Perplexity.
Run transform method before calculating perplexity
- Type:
- theta_
np.ndarray Topics probabilities vector.
- Type:
- topics_num_
int Number of topics.
- Type:
- transform(self, list docs, unicode infer_type=u'sum_b', bool verbose=True)
Return documents vs topics probability matrix.
- Parameters:
docs (list) – Documents list. Each document must be presented as a list of words ids. Typically, it can be the output of
bitermplus.get_vectorized_docs()
.infer_type (str) –
Inference type. The following options are available:
sum_b
(default).sum_w
.mix
.
verbose (bool = True) – Be verbose (show progress bar).
- Returns:
p_zd – Documents vs topics probability matrix (D vs T).
- Return type:
np.ndarray
- vocabulary_
np.ndarray Vocabulary (list of words).
- Type:
- vocabulary_size_
int Vocabulary size (number of words).
- Type:
- window_
int Biterms generation window size.
- Type: