Latent Dirichlet Allocation in R.
Theses / Institute for Statistics and Mathematics, 2.
WU Vienna University of Economics and Business, Vienna.
Topic models are a new research field within the computer sciences information retrieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes.
This thesis focuses on LDA's practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper ``Finding scientific topics'' by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R~package topicmodels by Bettina Grün and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal's website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA. (author's abstract)