How many data clusters are in the Galaxy data set? Bayesian cluster analysis in action

Grün, Bettina ORCID: https://orcid.org/0000-0001-7265-4773 and Malsiner-Walli, Gertraud and Frühwirth-Schnatter, Sylvia ORCID: https://orcid.org/0000-0003-0516-5552 (2021) How many data clusters are in the Galaxy data set? Bayesian cluster analysis in action. Advances in Data Analysis and Classification. ISSN 1862-5347

[img]
Preview
Text
Grün2021_Article_HowManyDataClustersAreInTheGal.pdf
Available under License Creative Commons: Attribution 4.0 International (CC BY 4.0).

Download (872kB) | Preview

Abstract

In model-based clustering, the Galaxy data set is often used as a benchmark data set tostudy the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the priorassumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin’s concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model wherea prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications arerecommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also,the regularizing properties of the priors may be intentionally exploited to obtain asuitable clustering solution meeting prior expectations and needs of the application.

Item Type: Article
Additional Information: The authors gratefully acknowledge support from the Austrian Science Fund (FWF): P28740, and through the WU Projects grant scheme: IA-27001574.
Keywords: Bayes, Cluster analysis, Galaxy data set, Mixture model, Prior specification
Divisions: Departments > Finance, Accounting and Statistics > Statistics and Mathematics
Version of the Document: Published
Depositing User: Gertraud Novotny
Date Deposited: 13 Sep 2021 09:55
Last Modified: 13 Sep 2021 09:55
Related URLs:
FIDES Link: https://bach.wu.ac.at/d/research/results/101122/
URI: https://epub.wu.ac.at/id/eprint/8276

Actions

View Item View Item

Downloads

Downloads per month over past year

View more statistics