What is LDA theme modeling

R packages for topic modeling / LDA: only "TopicModels" and "LDA" [closed]


It seems to me that only two R packets are able to Latent Dirichlet Allocation to carry out:

One is written by Jonathan Chang; and the other is from Bettina Grün and Kurt Hornik.

What are the differences between these two packages in terms of performance, implementation details, and extensibility?

Reply:


Implementation: The topicmodels package provides an interface to the GSL C and C ++ code for topic models by Blei et al. and Phan et al. Variational EM is used for the former and Gibbs sampling for the latter. See http://www.jstatsoft.org/v40/i13/paper. The package works well with the utilities from the TM package.

The lda package uses a reduced Gibbs sampler for a number of models similar to those in the GSL library. However, it was implemented by the authors of the package themselves, not by Blei et al. This implementation therefore generally differs from the estimation technique proposed in the original papers, which introduces these model variants, which normally use the VEM algorithm. On the other hand, the package offers more functionality than the other package. The package also offers text mining functions.

Extensibility: In terms of extensibility, the topic model code can inherently be extended to incorporate other topic model codes written in C and C ++. The lda package seems to rely more on the specific implementation of the authors, but Gibbs Sampler may allow you to specify your own theme model. For reasons of expandability, the first version is licensed under GPL-2 and the second under LGPL. This may depend on what you need to extend it for (GPL-2 is stricter on the open source aspect, meaning you can't use it in proprietary software).

Performance: I can't help you here, I've only used topic models so far.

Conclusion:
Personally, I use it as it is well documented (see the JSS paper above) and trust the authors (Grün also implemented flexmix and Hornik is a core R member).






+1 for themed models. @ Momo's answer is very comprehensive. I just want to add that inputs are used as document term matrices that can easily be created with the package or with Python. The package uses a more esoteric form of input (based on Blei's LDA-C) and I have had no luck using the built-in functions to convert dtm to the package format (the LDA documentation is very poor as Momo notes).

I put some code that starts with raw text, preprocesses it in, and enforces it (including finding the optimal number of topics in advance and working with the output) here. Might be useful to someone who I am first coming to.





The STM (R Structural Topic Model) package from Molly Roberts, Brandon Stewart, and Dustin Tingley is also a great choice. Building on the tm package, it is a general framework for topic modeling with covariate information at the document level.

https://structuraltopicmodel.com/

The STM package contains a number of methods (grid search) and measures (semantic coherence, residuals and exclusivity) to determine the number of topics. If you set the number of topics to 0, the model can also determine an optimal number of topics.

The stmBrowser package is an excellent addition to data visualization to visualize the influence of external variables on topics. See this example in the context of the 2016 presidential debates: http://alexperrier.github.io/stm-visualization/index.html.


I've used all three libraries, among all 3, namely theme models, lda, stm; Not everyone works with n grams. The TopicModels library is well appreciated and also works with n grams. But when someone works with unigrams, the practitioner may prefer STM because it provides structured output.

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.