Predictions of distant cancer metastasis based on gene signatures are studied intensively to realise precise diagnosis and treatments. Gene selection i.e. feature selection is a cornerstone to both establish accurate predictions and understand underlying pathologies. Here, we developed a simple but robust feature selection method using a correlation-centred approach to select minimal gene sets that have both high predictive and generalisation abilities. A multiple logistic regression model was used to predict 5-year metastases of patients with breast cancer. Gene expression data obtained from tumour samples of lymph node-negative breast cancer patients were randomly split into training and validation data. Our method selected 12 genes using training data and this showed a higher area under the receiver operating characteristic curve of 0.730 compared with 0.579 yielded by previously reported 76 genes. The signature with the predictive model was validated in an independent dataset, and its higher generalization ability was observed. Gene ontology analyses revealed that our method consistently selected genes with identical functions which frequently selected by the 76 genes. Taken together, our method identifies fewer gene sets bearing high predictive abilities, which would be versatile and applicable to predict other factors such as the outcomes of medical treatments and prognoses of other cancer types.
ASJC Scopus subject areas