Saturday, May 11, 2013

A Breast Cancer Survival Modeling Competition

California Whipsnake
This week, a paper on breast cancer survival modeling that I co-authored was published in PLoS Computational Biology.  Using a pool of patient information including clinical covariates, gene expression and copy number variation data for a collection of breast cancers we each attempted to produce the best predictive model of breast cancer survival that we could.  Our results and code were available in real time on a competition-style leaderboard that was automatically updated as we submitted new models to the competition.  In the end, we did achieve predictive models using the genetic data which performed statistically significantly better than the clinical-only models.  This is actually a fairly remarkable feat, since the genetic data was quite noisy.  In fact, while signals abound in the genetic data, actually adding this data to a clinical model tended to confuse the modeling algorithms and hurt our predictive power as often as it helped.

In the end, while our models were statistically significantly better, I don't believe that the difference was of much practical significance.  More worthwhile is knowing which signals actually improved our predictive power--these may be worthy of further investigation to better explain the correlations between the feature and cancer survival.

I'm not going to regurgitate the paper which is freely available via the link above, but I do want to highlight a few points that I believe are important and of general interest:

  1. Random Survival Forest is the best out-of-the-box survival model we found.  After testing this on some other data sets, I believe that Random Survival Forest may in fact be your best bet if you are doing survival modeling.
  2. A leaderboard or real-time model evaluation system is an enormous motivator in one's research.  The fact that you can quickly tweak a model and have it evaluated and placed next to your others for comparison takes much of the grind out of research.
  3. Competitions are not an inexpensive option to hiring personnel or doing your own research.  The computational overhead, tech support, and manpower to procure the data, produce the evaluation system, advertise the challenge, police the participants, and evaluate their submissions requires a substantial amount of manpower and expense.  

No comments:

Post a Comment