The textcat Package for n-Gram Based Text Categorization in R

Feinerer, Ingo and Buchta, Christian and Geiger, Wilhelm and Rauch, Johannes and Mair, Patrick and Hornik, Kurt ORCID: (2013) The textcat Package for n-Gram Based Text Categorization in R. Journal of Statistical Software, 52 (6). pp. 1-17. ISSN 1548-7660


Download (431kB)


Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods. (authors' abstract)

Item Type: Article
Keywords: text mining / text categorization / language identification / n-grams / textcat / R
Divisions: Departments > Finance, Accounting and Statistics > Statistics and Mathematics > Hornik
Version of the Document: Published
Depositing User: ePub Administrator
Date Deposited: 09 Oct 2013 12:04
Last Modified: 24 Oct 2019 13:42


View Item View Item


Downloads per month over past year

View more statistics