Tuesday, July 17, 2012

Sage/DREAM Breast Cancer Challenge Launch

This morning's webinar marked the public launch of the Sage/DREAM Breast Cancer Challenge.  The goal is simple:  given a large data set consisting of clinical covariates, copy number data, expression data, and survival data, can you produce the best predictions of survival time for new patients?  The unstated goal is a little bit more subtle:  so far, there is no compelling evidence that I have seen that microscopic information such as gene expression or copy number data adds anything to predictive power in macroscopic behavior such as the survival time of the patient.  We hope that this competition will encourage people to build better models of disease collaboratively, and produce some of the first evidence that gene expression and copy number data can actually be useful in predicting patient prognosis.  As an example, I will produce an entry that you can feel free to cannibalize, and will write more about it as I evolve it into a better model of disease.  As a Sage employee, I am not eligible to win and so I hope you will modify and adapt this code and produce your own entry, superior to my own!

Before trying my code, you will want to check out the Breast Cancer Competition Getting Started Guide.

From an internal competition we ran that preceded this launch, we learned that the R RandomSurvivalForest package produced the best models of survival given the data, and that in fact the most critical part of the model was actually which data is chosen for inclusion and how it is pre-processed before being handed over to RandomSurvivalForest.

First, I will produce the core of my submission, the model class file.  I will take only the clinical covariates and totally ignore the copy number and expression data, though I will leave a space where they can be included later.  In fact, you will find that this transformation function is the most important part of the entire competition:  victory will hinge not on the best model, but instead on cleaning up the data in the best way possible.

View my Model Class File on github.

Next, I will train and submit the model.  This code is relatively straightforward, but if you are going to modify my model for your own purposes, you will want to make sure that you submit to the public leaderboard instead of the Sage leaderboard!

View my Submission File on github.

As you can see, my Random Survival Forest Clinical-Only model did pretty well!  On the Sage employees leaderboard, model "Sauerwine RSFModel test 3" got a respectable test score of 0.71.  In my next post, we'll see how to improve that further.  At the time of this post, it actually does better than any model on the Sage/DREAM Public Leaderboard, but since you're free to take my model and improve on it, that's not likely to be the case for long!

Good luck, and happy modeling!

2 comments:

  1. Hi Ben!
    Thanks much for posting your code. I also wanted to use Random Forest method in SAGE competition, but not sure how to deal with expression dataset and copy number dataset. What I'm doing now - just doing some clustering on expression levels. Do you think it's relevant to apply RFS then? Thanks much!

    ReplyDelete
    Replies
    1. Hi Svetlana--

      Having competitors include the expression and copy number features in a model and achieving improved predictive power is actually the unstated goal of those behind the competition! I'd planned on submitting a sample model that did just that, but it proved more difficult than I bargained for and was I unable to gain anything by including expression or copy number in my RFS model.

      As I'm sure you saw from your clustering, there is definitely some signal there but it's relatively weak and shrouded in noise. I've heard some suggestions that might be able to divine this signal out and produce a metric that would improve the RFS model, including a Genomic Instability Index and R0 maps, but whether those are fruitful remains to be seen.

      Good luck in the competition!

      Delete