Sunday, July 15, 2012

Keep it Simple!

I've updated my CV to reflect some success I've had lately in TopCoder Marathon Matches.  I competed in the Harvard Medical School #3 Competition, where I did dismally, and the NASA Tournament Lab "SynchronousControl" Weekend Competition, where I got a 4th place award.

I learned a lot from these contests, so I'll comment a bit on how they went and how I could have improved.

In the HMS #3 Competition, users were asked to examine simulated gene marker time series data to determine which genes were being most strongly selected for and against.  This boiled down to doing a regression on some data points from a Markov chain, and trying to determine what the parameters of the Markov chain were.  In principle, it was actually pretty straightforward.  Simply returning "parameter = (x1-1)/x0" actually did pretty well--better than my solution!  I had tried several different and more advanced regressions, as well as methods to try to account for uncertainty on the data points.  Ultimately, it turns out that all of this complex math did slightly worse than the obvious solution stated above.

The question itself had some problems, in my mind.  First, they assume that the reading at any given time point was correct and free of noise for the purposes of the next iteration of the Markov chain.  Second, the nature of the problem was such that you could not nominate a gene for being both a "most up-regulated" and "most down-regulated" gene.  Sometimes, the noise turned out to be so extreme that my best guess would have been to nominate a gene for both categories!  Finally, the simulated data was modified so that no gene would ever have a reading of zero, which is not realistic given the nature of the experiment.  So, I'm not entirely certain that the researchers who ran the contest are ultimately going to find that the solution is what they actually wanted in the first place.

What I should have done, in retrospect, is the following:  First, I should have immediately dissected and worked backwards from the data simulation program.  The problem description was not exactly representative of the way the data was being simulated, and understanding this earlier certainly would have affected my progress.  Simulating a lot of my own data for my own test harness also would have been instrumental.  Second, I found that very complex methods are actually not all that much better than very simple methods.  I should have taken all of the simple models I tried and compared them in parallel to simulated data and understood the domains in which each model worked the best, then attempted to identify the domain that each data set was in and used the best simple model per task.

So, I say that I did "dismally" because the extremely complex model I settled on was embarrassingly worse than the simplest possible model.

To be fair, the NTL competition that I got a 4th place award in didn't go a lot better.  The problem was to determine what moves to make to help a bunch of robots escape from a maze, under the condition that all of the robots must make the same set of moves!  The principal challenge was that the program to find these moves had to complete in under 2 seconds, which left time for just one A* or BFS in the largest possible map.  I spent way too much time working on a solution that I had to discard because it performed more searches than could be conducted in 2 seconds.

If I had realized this earlier, I would have focused on a heuristic to help robots cooperatively reach their goals instead of focusing on ways to greedily get each robot out individually in the best time.

The algorithm I ultimately used was this:  "Identify the closest robot to any goal.  Perform the minimum number of moves to get that robot out."  This crude method was sufficient to get a 4th place award in my room, $50.  What I would have done if I had to do it again would be this:  "Identify the closest robot to any goal.  Perform the move that minimizes the average distance between all robots and any goal, subject to the constraint that the closest robot must either move closer or stay the same distance away."

The overall lesson was this:  keep it simple.  I often find myself gravitating towards some complex custom algorithm which may either grossly overfit in the case of the HMS competition or may spend way too much processor time grinding away towards a solution when a crude heuristic could have approximated a better solution much more quickly.  When working on these open-ended problems, I need to fight the urge to write a complex algorithm, stay focused on the goal, and home in on the best solution using increasingly better heuristics.


No comments:

Post a Comment