Ben Li-Sauerwine's Notebook: March 2012

Sunday, March 25, 2012

Systems Problem, Systemic Solution

I had the privilege of attending the Mathematics of ICBP conference in Tampa last week as well as a meeting regarding an upcoming non-profit start-up with the goal of finding new treatments using a very high-throughput testing apparatus. The most important lesson learned from these meetings is that cancer is a systems biology problem spanning many scales, with applicability to multiple disciplines, and whose ultimate solution is going to involve the marriage of theory and experiment.

Indeed, from my work I am learning that the models we use--random forest, elastic net, linear regression--are all actually quite good and in fact ultimately similar in their predictive power. Instead, we suffer from having relatively few data points relative to the number of parameters we might be interested is. That is, while we may have dozens of clinical features and thousands of copy number variations and gene expression levels, we have knowledge of very few actual patients. It gets worse, though, as there are concerns as to whether some of the data we have is even on the "real" cancer. Below, I list the major issues and open problems in understanding cancer.

We do not understand how to link microscopic behavior of cancer to the macroscopic behavior. That is, while we may gather a great deal of copy number and expression data points, we don't fully understand how that connects to the patient-scale phenotypes: metastasis, recurrence, and prognosis.
Because of their wild growth rate and minimal restrictions on DNA mutations, tumors exhibit surprising heterogeneity. This heterogeneity allows the cancer to evolve resistance to whatever drugs are applied to it through natural selection. How can we study these heterogeneous sub-populations of tumor cells?
Some cancers may originate from cancer stem cells which compose a small fraction of the total tumor. The effectiveness of treatments may be gauged based on their effect on benign daughter cells while the true desired target would be the very mobile stem cells. How can we learn to kill a cell that we may be unable to even isolate for study?
When you take a tumor out of a person and put it in culture or into a mouse for study, the nature of the tumor somehow changes. The cells within the culture become visibly different from their progenitors in human. How can we say that lessons from the mouse and in vitro models of cancer will be applicable to treatments in the human environment?

It's not all problems, though! I learned about some new and exciting developments in cancer systems biology as well:

TCGA is going to release protein expression profiles within the next year!
The Gene Expression Commons will soon give an 'absolute' reference point for gene expression data!
Dr. Carla Grandori has shown the effectiveness of a high-throughput screening platform in finding new therapies for difficult-to-target cancers.

The way forward as I see it is simple: this is a systems biology problem, and to tackle it we need more data, open data, and more integration of databases and models from theorists, experimentalists, and clinicians. Systems biology is complex, but ultimately understandable if we can just put the pieces together.

Wednesday, March 21, 2012

Concordance Index and Cox Modeling

If you find yourself modeling survival with censored data, eventually you will encounter concordance index and Cox models. In this example, I will use R with the survcomp package and in particular the function concordance.index. The two most important inputs to the concordance.index function are the right-censored observations from experiment and a set of numeric lifetime predictions from your model. Right-censored data arises from many clinical studies where patients may drop out of the study or the study may end before an event (e.g., death) occurs. So, for each patient or observation, we have two pieces of information: a time, and whether or not an event has occurred by that time. Thus being "right-censored" means that we may only have lower bound on that the time of an event rather than knowing it exactly.

In the first example, I will use the concordance.index function with a simple predictor to show how concordance index is affected by noise. Here, the predictor guesses the actual correct time plus some Gaussian noise. Since the concordance.index function expects hazard ratios and not times, the most accurate guess has nearly perfect anticorrelation, and for high noise the concordance goes to its random limit of 0.5.

## EXAMPLE 1 ## Adapted from code by A. Margolin


testObservations <- rexp(100)



## Some standard deviations for the noise to add

testSds <- seq(.01, 10, .01)



myPredictions <- foreach(i = 1:length(testSds)) %do% {

  ## Add the noise--don't allow negative observations!

  return(abs(testObservations + rnorm(100, sd=testSds[i])))

}


## Get concordance indices for each standard deviation

testCIndices <- foreach (i = 1:length(testSds)) %do% {

  testCIndex <- concordance.index(myPredictions[[i]],


                                  testObservations,

                                  array(TRUE,100))

  return(testCIndex$c.index)

}



plot(testSds, testCIndices)

Plot 1: Perfect Anticorrelation

In example 2, I will use the inverse function to convert the time to a rate. Since the ordering of the predictions is exactly reversed, my concordance index now drops from 1 to 0.5.

## EXAMPLE 2 ## Perfect Correlation

testCIndices <- foreach (i = 1:length(testSds)) %do% {

testCIndex <- concordance.index(as.list(lapply(

                                            myPredictions[i],

                                            function(x) {1/x}) 



                                           )[[1]], 



                                  testObservations, 



                                  array(TRUE,100))



  return(testCIndex$c.index)

}

plot(testSds, testCIndices)

Plot 2: Perfect Correlation

Suppose I observe three events for patients A, B and C at time obsA = 1 year, obsB = 2 years and obsC = 3 years. If, in arbitrary units, I predicted predA = 1, predB = 2, and predB = 3, then my concordance index is 1 for these predictions. Alternatively, if I predicted predA = 0.1, predB = 3, and predC = 3.1, then my concordance index is still 1 for these predictions! Only the order mattered, not their values. So, if obsA < obsB < obsC and predA < predB < predC, then the concordance index is 1. Alternatively, if predC < predB < predA, then my predictions are exactly anticorrelated and the concordance index is zero. The calculation itself is described in the open-access paper, On Ranking in Survival Analysis: Bounds on the Concordance Index. The general idea is that you may randomly select two pairs of observation and prediction. If the smaller observation also had the smaller prediction, we score 2 points. If the predictions were tied, we score 1 point. If the smaller observation had the larger prediction, we score no points. At the end, we normalize by 2 times the number of trials whose observations could be compared, and obtain our concordance index.

In example 3, instead of taking the inverse, I simply multiplied all of my observations by -1. Comparing the result to that from example 2, you would find that they are exactly identical because only order mattered. I haven't shown the duplicate plot here, but you may run the R code yourself if you don't believe me!

## EXAMPLE 3 ## Perfect Correlation

testCIndices <- foreach (i = 1:length(testSds)) %do% {

testCIndex <- concordance.index(as.list(lapply(


                                      myPredictions[i],



                                      function(x) {-1*x})



                                   )[[1]],



                                  testObservations, 



                                  array(TRUE,100)) 



  return(testCIndex$c.index)



}

plot(testSds, testCIndices)

This means that the concordance index calculation is quite versatile in terms of the inputs. If I used a Cox model with hazard ratio results, I could use that in my concordance index calculation to compare to the observations since the order of the hazard ratios should be exactly opposite the order of the experimental observations. Alternatively, if I used a model that gives results that are expected lifetimes in arbitrary units, I could also use that because the order of predictions should ideally be the same as the order of observations. The sign of (result - 0.5) using rates versus using times will be opposite, because the order of the rates versus the order of the times is opposite.

In the final example, I generate some linear data with random noise and fit it with both a linear model and Cox model. Since all data points were a known endpoint with no censoring, the result of concordance.index is exactly the same for both models. If censorship was present, the Cox model may have had an improved fit to the data.

## EXAMPLE 4 ## Cox and Linear Models are
## Identical for Non-Censored Data


obsX <- seq(0,9.9,.1)

for (i in 1:100) {

  testObservations[i] <- testObservations[i] + i * .1

}

survObservations <- Surv(testObservations, array(TRUE,100))



obsFrame <- cbind(data.frame(obsX), data.frame(testObservations))

survFrame <- cbind(data.frame(obsX), data.frame(survObservations))



## I've made a sparse, noisy line.  I'm so proud of myself!

plot(testObservations)



## Let's fit it with linear and cox models.

linReg <- lm(testObservations ~ ., obsFrame)

coxReg <- coxph(survObservations ~ ., survFrame)



## And map them back

linPredict <- predict(linReg, data.frame(obsX))

coxPredict <- predict(coxReg, data.frame(obsX), type="risk")



## Very negative, because concordance.index wants rates

concordance.index(x=linPredict,

                  surv.time=survObservations[,"time"],

                  surv.event=survObservations[,"status"],

                  na.rm=TRUE,

                  alpha= .05)$c.index

## Returned [1] 0.09010101 in my test






## Very positive, because the rates are in the reverse order

concordance.index(x=-1.0 * linPredict,

                  surv.time=survObservations[,"time"],

                  surv.event=survObservations[,"status"],

                  na.rm=TRUE,

                  alpha= .05)$c.index

## Returned [1] 0.909899 in my test






## It's just the same as above!

concordance.index(x=coxPredict,

                  surv.time=survObservations[,"time"],

                  surv.event=survObservations[,"status"],

                  na.rm=TRUE,

                  alpha= .05)$c.index

## Returned [1] 0.909899 in my test

I found the open-access article Hazard Ratio in Clinical Trials useful in understanding the Cox model, and it is definitely worth taking a look at if you intend on doing this type of analysis.