Model-based clustering based on sparse finite Gaussian mixtures

Malsiner-Walli, Gertraud and Frühwirth-Schnatter, Sylvia and Grün, Bettina ORCID: (2016) Model-based clustering based on sparse finite Gaussian mixtures. Statistics and Computing, 26 (1). pp. 303-324. ISSN 1573-1375

Available under License Creative Commons: Attribution 4.0 International (CC BY 4.0).

Download (1MB)


In the framework of Bayesian model-based clustering based on a finite mixture of Gaussian distributions, we present a joint approach to estimate the number of mixture components and identify cluster-relevant variables simultaneously as well as to obtain an identified model. Our approach consists in specifying sparse hierarchical priors on the mixture weights and component means. In a deliberately overfitting mixture model the sparse prior on the weights empties superfluous components during MCMC. A straightforward estimator for the true number of components is given by the most frequent number of non-empty components visited during MCMC sampling. Specifying a shrinkage prior, namely the normal gamma prior, on the component means leads to improved parameter estimates as well as identification of cluster-relevant variables. After estimating the mixture model using MCMC methods based on data augmentation and Gibbs sampling, an identified model is obtained by relabeling the MCMC output in the point process representation of the draws. This is performed using K-centroids cluster analysis based on the Mahalanobis distance. We evaluate our proposed strategy in a simulation setup with artificial data and by applying it to benchmark data sets. (authors' abstract)

Item Type: Article
Additional Information: This article is published with open access at
Keywords: Bayesian mixture model / multivariate Gaussian distribution / Dirichlet prior / normal Gamma prior / Sparse modeling / cluster analysis
Divisions: Departments > Finance, Accounting and Statistics > Statistics and Mathematics
Version of the Document: Published
Depositing User: Gertraud Novotny
Date Deposited: 05 Feb 2016 10:59
Last Modified: 17 Aug 2021 10:39
Related URLs:


Anderson, E.: The Irises of the Gaspé Peninsula. Bull. Am. Iris Soc. 59,2-5 (1935)

Armagan, A., Dunson, D., Clyde, M.: Generalized beta mixtures of Gaussians. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K. (eds.) Advances in Neural Information Processing Systems (NIPS) 24, pp. 523-531, Curran Associates, Inc., (2011)

Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803-821 (1993)

Baudry, J., Raftery, A.E., Celeux, G., Lo, K., Gottardo, R.: Combining mixture components for clustering. J. Comput. Gr. Stat. 19, 332-353 (2010)

Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P.: Inference in model-based cluster analysis. Stat. Comput. 7, 1-10 (1997)

Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEETrans. Pattern Anal. Mach. Intell. 22 (7), 719-725 (2000)

Campbell, N., Mahon, R.: A multivariate study of variation in two species of rock crab of genus Leptograpsus. Austr. J. Zool. 22, 417-425 (1974)

Celeux, G.: Bayesian inference for mixture: the label switching problem. In: Green, P.J., Rayne, R. (eds.) COMPSTAT 98, pp. 227-232. Physica, Heidelberg (1998)

Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95, 957-970 (2000)

Celeux, G., Forbes, F., Robert, C.P., Titterington, D.M.: Deviance information criteria for missing data models. Bayesian Anal. 1(4), 651-674 (2006)

Chung, Y., Dunson, D.: Nonparametric Bayes conditional distribution modeling with variable selection. J. Am. Stat. Assoc. 104, 1646-1660 (2009)

Dasgupta, A., Raftery, A.E.: Detecting features in spatial point processes with clutter via model based clustering. J. Am. Stat. Assoc. 93(441), 294-302 (1998)

Dean, N., Raftery, A.E.: Latent class analysis variable selection. Ann. Inst. Stat. Math. 62, 11-35 (2010)

Dellaportas, P., Papageorgiou, I.: Multivariate mixtures of normals with unknown number of components. Stat. Comput. 16, 57-68 (2006)

Diebolt, J., Robert, C.P.: Estimation of finite mixture distributions through Bayesian sampling. J. R. Stat. Soc. B 56, 363-375 (1994)

Fisher, R.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(2), 179-188 (1936)

Frühwirth-Schnatter, S.: Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. J. Am. Stat.Assoc. 96(453), 194-209 (2001)

Frühwirth-Schnatter, S.: Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econ. J. 7, 143-167 (2004)

Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer-Verlag, New York (2006)

Frühwirth-Schnatter, S.: Label switching under model uncertainty. In: Mengerson, K., Robert, C., Titterington, D. (eds.) Mixtures: Estimation and Application, pp. 213-239. Wiley, New York (2011a)

Frühwirth-Schnatter, S.: Panel data analysis - a survey on model-based clustering of time series. Adv. Data Anal. Classif. 5(4),251-280 (2011b)

Frühwirth-Schnatter, S., Kaufmann, S.: Model-based clustering of multiple time series. J. Bus. Econ. Stat. 26 (1),78-89 (2008)

Frühwirth-Schnatter, S.,Pyne,S.:Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11(2),317-336 (2010)

Geweke, J.: Interpretation and inference in mixture models: simple MCMC works. Comput. Stat. Data Anal. 51, 3529-3550 (2007) 123-324 Stat Comput (2016) 26:303-324

Griffin, J.E., Brown, P.J.: Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 5(1),171-188 (2010)

Grün, B., Leisch, F.: Dealing with label switching in mixture models under genuine multimodality. J. Multivar. Anal. 100(5),851-861(2009)

Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. J. R. Stat. Soc. A 170(2),301-354 (2007)

Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classif. 4,3-34 (2010)

Ishwaran, H., James, L.F., Sun, J.: Bayesian model selection in finite mixtures by marginal density decompositions. J. Am. Stat. Assoc. 96(456),1316-1332 (2001)

Jasra, A., Holmes, C.C., Stephens, D.A.: Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Stat. Sci.20(1),50-67 (2005)

Juárez, M.A., Steel, M.F.J.: Model-based clustering of non-Gaussian panel data based on skew-t distributions. J. Bus. Econ. Stat. 28(1),52-66 (2010)

Kaufman,L., Rousseeuw,P.J.:Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)

Kim, S., Tadesse, M.G., Vannucci, M.: Variable selection in clustering via Dirichlet process mixture models. Biometrika 93 (4),877-893 (2006)

Kundu, S., Dunson, D.B.: Bayes variable selection in semiparametric linear models. J. Am. Stat. Assoc. 109 (505),437-447 (2014)

Lee, H., Li, J.: Variable selection for clustering by separability based on ridgelines. J. Comput. Gr. Stat. 21(2),315-337 (2012)

Lee, S., McLachlan, G.J.: Finite mixtures of multivariate skew t-distributions: some recent and new results. Stat. Comput. 24(2),181-202 (2014)

Leisch, F.: A toolbox for K-centroids cluster analysis. Comput. Stat. Data Anal. 51(2),526-544 (2006)

Li, J.: Clustering based on a multi-layer mixture model. J. Comput. Gr. Stat. 14,547-568 (2005)

Lian, H.: Sparse Bayesian hierarchical modeling of high-dimensional clustering problems. J. Multivar. Anal. 101(7), 1728-1737 (2010)

Liverani, S., Hastie, D.I., Papathomas, M., Richardson, S.: PReMiuM: An R package for profile regression mixture models using Dirichlet processes, arXiv preprint arXiv:1303.2836 (2013)

Maugis, C., Celeux, G., Martin-Magniette, M.L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65(3), 701-709 (2009)

McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley series in probability and statistics. Wiley, New York (2000)

McLachlan, G.J., Bean, R.W., Peel, D.: A mixture-model based approach to the clustering of microarray expression data. Bioinformatics 18,413-422 (2002)

McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput 18(3),285-296 (2008)

McNicholas, P.D., Murphy, T.B.: Model-based clustering of longitudinal data. Can. J. Stat. 38(1), 153-168 (2010)

Molitor, J., Papathomas, M., Jerrett, M., Richardson, S.: Bayesian profile regression with an applicationto the national survey of children's health. Biostatistics 11(3), 484-498 (2010)

Nobile, A.: On the posterior distribution of the number of components in a finite mixture. Ann. Stat. 32, 2044-2073 (2004)

Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145-1164 (2007)

Park, T., Casella, G.: The Bayesian Lasso. J. Am. Stat. Assoc.103(482),681-686 (2008)

Polson, N.G., Scott, J.G.: Shrink globally, act locally: sparse Bayesian regularization and prediction. In: Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A., West, M. (eds.) Bayesian Statistics, vol. 9, pp. 501-523. Oxford University Press, Oxford (2010)

Raftery, A.E., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. Assoc. 101 (473),168-178 (2006)

Richardson, S., Green, P.J.: On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. B 59 (4), 731-792 (1997)

Rousseau, J., Mengersen, K.: Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. B 73(5), 689-710 (2011)

Sperrin, M., Jaki, T., Wit, E.: Probabilistic relabelling strategies for the label switching problem in Bayesian mixture models. Stat. Comput. 20(3),357-366 (2010)

Stephens, M.: Bayesian methods for mixtures of normal distributions. Ph.D. thesis, University of Oxford (1997)

Stephens, M.: Dealing with label switching in mixture models. J. R. Stat. Soc. B 62, 795-809 (2000)

Stingo, F.C., Vannucci, M., Downey, G.: Bayesian wavelet-based curve classification via discriminant analysis with Markov random tree priors. Statistica Sinica 22(2), 465 (2012)

Tadesse, M.G., Sha, N., Vanucci, M.: Bayesian variable selection in cluster in high dimensional data. J. Am. Stat. Assoc. 100(470),602-617 (2005)

Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer-Verlag, New York (2002)

Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2),440-448 (2008)

Xie, B., Pan, W., Shen, X.: Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64 (3),921-930 (2008)

Yao, W., Lindsay, B.G.: Bayesian mixture labeling by highest posterior density. J. Am. Stat. Assoc. 104,758-767 (2009)

Yau, C., Holmes, C.: Hierarchical Bayesian nonparametric mixture models for clustering with variable relevance determination. Bayesian Anal. 6(2),329-352 (2011)

Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977-987 (2001)


View Item View Item


Downloads per month over past year

View more statistics