Distributed Text Mining in R

Theußl, Stefan and Feinerer, Ingo and Hornik, Kurt ORCID: https://orcid.org/0000-0003-4198-9911 (2011) Distributed Text Mining in R. Research Report Series / Department of Statistics and Mathematics, 107. WU Vienna University of Economics and Business, Vienna.

[img]
Preview
PDF
Theussl_etal-2011-preprint.pdf

Download (495kB)

Abstract

R has recently gained explicit text mining support with the "tm" package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) an increase of the amount of data to be analyzed leads to increasing computational workload. Fortunately, adequate parallel programming models like MapReduce and the corresponding open source implementation called Hadoop allow for processing data sets beyond what would fit into memory. In this paper we present the package "tm.plugin.dc" offering a seamless integration between "tm" and Hadoop. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.

Item Type: Paper
Keywords: text mining / MapReduce / distributed computing / Hadoop
Divisions: Departments > Finance, Accounting and Statistics > Statistics and Mathematics
Version of the Document: Published
Variance from Published Version: None
Depositing User: Stefan Theußl
Date Deposited: 21 Mar 2011 08:20
Last Modified: 24 Oct 2019 13:41
URI: https://epub.wu.ac.at/id/eprint/3034

Actions

View Item View Item

Downloads

Downloads per month over past year

View more statistics