Blog

Adam Zagdański | 17.11.2011 | Tags: biomarker discovery, reproducible results

In this article I would like to draw attention to reproducibility of results in biomarker discovery, focusing on the statistical perspective of the problem. Admittedly, reproducibility is a desired property of real markers. However, the important relationship between reproducibility requirement and the stability of statistical feature selection methods is not commonly known yet. So, let me try to clear things up a little...

A need for reproducibility in biomarker discovery

Even though the clinical potential of proteomics and metabolomics in biomarker discovery seems to be high, a non-reproducibility of detected (putative) biomarkers remains the main obstacle. Biomarkers identified by different research groups or even results based on different experiments conducted in the same lab often differ markedly. Hence, there is an emerging need for efficient statistical methods that will address issues of reproducibility and increase our confidence in discovered markers.

Marker reproducibility and feature selection stability

From the statistical perspective, the discovery of biomarkers from high-throughput 'omics' data means searching for the most discriminating features (e.g. features discriminating healthy from disease samples). Such task is usually referred to as feature selection (a good review of feature selection techniques in bioinformatics is given e.g. in [1]).

Surprisingly, despite the proliferation of feature selection algorithms in biomarker discovery, reproducibility issues did not receive enough attention in the literature. Nevertheless, research interest in this topic seems to be increasing recently (see e.g [2], [3] or [4]).

Let me first describe the typical feature selection scenario and then point out its limitations.

Perhaps the simplest approach to feature selection is to analyze each feature separately using univariate tools, such as formal statistical tests (e.g. t-test) or separation measures (e.g. Area Under ROC Curve (AUC) measure). However, a clear drawback of univariate methods is that they disregard the multidimensional structure of the analyzed data and hence neglect important relationships between features. Surprisingly, in spite of such limitations, univariate feature selection methods are often the only biomarker discovery tools available in proteomics or metabolomics software solutions.

A more suitable alternative is to select feature subset using multivariate model-based approach. In this case, feature selection method tries to find a small subset of features resulting in the most accurate predictive model. Popular multivariate feature selection methods include stepwise feature selection (e.g forward and backward feature selection) coupled with various classification models (e.g. logistic regression, SVM classifier, etc.).

Unfortunately, when applying different feature selection algorithms – even for the same data – one may identify different subsets of candidate biomarkers that achieve comparable accuracy. Results may also depend on the chosen settings of a given feature selection method. And what is even worse, different random subsets (of similar size) drawn from the original data can lead to different optimal feature subsets too.

  • What causes instability in feature selection?
  • How to alleviate this problem?
  • Are there any algorithms available that are expected to produce stable (robust) feature subsets?

I guess, these are main concerns that come to our mind when we face feature selection instability.

So, let's try to shed some light on these questions...

Main causes of instability in feature selection

Hea and Yub (2010) (see [2]) has published recently an interesting review paper devoted to stable feature selection for biomarker discovery. Among different aspects related to stability in feature selection, authors point out the three main sources of instability, i.e.:

  1. Algorithm design without considering stability (i.e. ignoring stability in the feature selection algorithm).
  2. The existence of multiple sets of true markers (either many highly correlated features may exist and may be identified under different settings or multiple non-correlated sets of real markers can be present).
  3. Small number of samples in high dimensional data (relatively small number of samples (n) and much larger number of features (p)).

As mentioned in [2], among these three sources of instability, the small number of samples seems the most difficult one in biomarker discovery.

Figure 1: Classification of stable feature selection methods

A review of stable feature selection methods

In [2] authors review in a systematic way all existing methods for stable feature selection in biomarker discovery applications. The summary is based on the way in which different algorithms handle different sources of instability. Figure 1 presents the proposed classification of stable feature selection methods.

Thus, all methods currently available for stable feature selection can be divided into four main categories:

  1. Ensemble feature selection methods

    Instead of relying on a single (unstable) feature selector, a committee of feature selectors is built and used to find optimal feature subset. This can be achieved either by: a) data perturbation (i.e. random samplings of the original data are used to construct different feature selectors) or b) algorithm perturbation (i.e. results obtained from diverse feature selection algorithms are aggregated to find robust feature subset).

  2. Methods using prior feature relevance

    In this case prior knowledge may be used in the feature selection process, i.e. some features can be assumed to be more relevant than others. Such prior information can be obtained either from experts, relevant publications or can be transformed from related data sets.

  3. Group feature selection methods

    Such methods handle data with highly correlated features. Instead of a single feature, a feature cluster (or feature group) is used as the basic entity to increase stability of results. The groups of associated features can be identified using either: a) knowledge-driven methods (e.g. methods incorporating pathway information) or b) data-driven methods (e.g. methods based on the cluster analysis).

  4. Methods based on the sample injection

    These methods try to increase the sample size to address the ''large p small n'' problem, i.e. in the typical biomarker studies number of features (p) is usually much larger than the sample size (n). Available sample injection strategies include: a) methods using transductive learning (i.e. methods utilizing the information embedded in the test data) or b) methods using artificial training samples (i.e. generating artificial samples on the basis of distribution of original samples).

Which stable feature selection method is the best?

As usually, there is no simple answer to such question. However, let me invoke here the main findings given in [2]. According to Hea and Yub group feature selection is the most extensively studied method among existing stable feature selection methods. However, using such approach we still need to face the reproducibility issue in the transformed space (i.e. the space of derived feature clusters). Additionally, it is possible that multiple sets of real biomarkers share no correlated features. Taking all these concerns into account, an ensemble feature selection strategy seems to be more universal solution to handle instability in feature selection.

Figure 2: Multiple Biomarker Detection Analysis in Spectrolyzer: feature ranking
Figure 3: Multiple Biomarker Detection Analysis in Spectrolyzer: model performance plot

Need for software tools yielding reproducible biomarkers

As I mentioned, feature selection stability in biomarker discovery seems to attract much stronger research interest recently. However, to my knowledge, such algorithms are not yet available in software tools dedicated to biomarker discovery from 'omics' data. High classification accuracy remains the main (and often the only) criterion used to assess the quality of identified biomarker subset, while reproducibility requirement is not taken into account. Thus, when designing our software (Spectrolyzer) we decided to fill this gap.

Specifically, we developed Multiple Biomarker Detection analysis (MBD), which is an advanced tool to find a robust subset of the most discriminative features. MBD analysis is expected to produce reproducible results for new experiments. We believe that using this approach increases chances for new discoveries as well as it may shorten time and reduce cost related to clinical application.

MBD uses ensemble-based approach to selecting features for biomarker discovery, which is similar in spirit to the one described in [4]. The algorithm takes into account that good stability of feature selection is equally important as high predictive accuracy, i.e. the method reward frequently selected features resulting in accurate predictive models. Figures 2 and 3 show typical results obtained using MBD analysis, i.e. derived feature ranking and estimated performance of the predictive model based on the selected (best) features. More details on MBD analysis can be found in our manual.

Stability in feature selection: What's next?

Undoubtedly, stability in feature selection is a very important and challenging research topic. Since feature selection stability is directly related to markers reproducibility it should be investigated both on theoretical and practical ground. Hence, more research efforts in this direction is required as well as efficient software tools are needed.

As mentioned, ensemble feature selection seems to be the most universal solution to handle instability in feature selection. That's why it was the method we implemented in the MBD Analysis. However, other approaches are worth considering too. For example, in [2] an interesting hybrid strategy was suggested to combine group feature selection with ensemble feature selection, i.e., first perform feature grouping and then use ensemble feature selection in the new feature space.

Anyways, our interest in this subject is still increasing, and I hope we will add other stable feature selection algorithms to our software in the near future.

Well, we shall see...

References

[1] Yvan Saeys et al. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics vol. 23 no. 19 2007, pages 2507–2517 

[2] Zengyou Hea and Weichuan Yub (2010) Stable feature selection for biomarker discovery. Computational Biology and Chemistry, Volume 34, Issue 4, 2010. 

[3] Thomas Abeel et al. (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, vol. 26 no. 3, 2010, pages 392–398. 

[4] Diana Chan, Susan Bridges, and Shane Burgess (2007). An ensemble method for identifying robust features for biomarker discovery. Computational Methods of Feature Selection, ch.19, pp.377–392. CRC Press, 2007. 

Comments

Add comment

A space for scientists created by MedicWave

Calendar

« November 2011»
S M T W T F S
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30      

Archive

© 2011 MedicWave, All rights reserved.

Download trial version

Please fill in the fields below to receive a license for your Spectrolyzer trial.