Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution

Strobl, Carolin and Boulesteix, Anne-Laure and Zeileis, Achim and Hothorn, Torsten (2006) Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. Research Report Series / Department of Statistics and Mathematics, 40. Department of Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna.

WarningThere is a more recent version of this item available.
[img]
Preview
PDF
document.pdf

Download (598kB)

Abstract

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. (author's abstract)

Item Type: Paper
Keywords: random forests / variable importance / Gini importance / variable selection bias
Divisions: Departments > Finance, Accounting and Statistics > Statistics and Mathematics
Depositing User: Repository Administrator
Date Deposited: 20 Sep 2006 10:26
Last Modified: 22 Oct 2019 00:40
URI: https://epub.wu.ac.at/id/eprint/1274

Available Versions of this Item

Actions

View Item View Item

Downloads

Downloads per month over past year

View more statistics