Sunday, July 22, 2012

Washington State Primary Ballot: Who Are They?

I had a new experience yesterday:  I got my primary election ballot in the mail!  Coming from Pennsylvania and having registered non-partisan, I was not allowed to vote in Pennsylvania primaries.  I always thought that was odd:  if you're a political party, wouldn't you care about what the people with no party affiliation thought about your candidates as much as or more than those in your party?  Since people in your party are not likely to change sides, the independents are the people most likely to be swayed to your side in an election!

Anyhow, I was dismayed when I saw the primary ballot for two reasons.  First, I saw that the ballot included such names as "Mike the Mover" and "Goodspaceguy."  To me, these non-names don't inspire much confidence in the candidates, and it turns out their websites didn't either.  Second, when I started searching for candidates, I noted that many didn't even have a website, or if they did their website didn't include meaningful information about their platform.  I don't want this to be a blog about anything but facts and opinion supported directly by facts, so I will keep this post simple.  What follows is simply a table of candidates on the Washington State Primary Election ballot, and if I could find a website why you should vote for them in their own words.  I hope you will use this information to make the best informed decision in the upcoming primary and election, and may the best candidate win!  Please also be aware of the election pamphlet available at the King County website, which I found helpful but incomplete.

For United States Senator
NamePartyWhy you should vote for
Will BakerReformNo Website Found
Timmy (Doc)
Glen (Stocky) R.
Mike the

For United States Representative
NamePartyWhy you should vote for

For Washington State Governor
NamePartyWhy you should vote for
Rob HillDemocraticNo Website Found
Christian JoubertNo Party
L. Dale
Max SampsonRepublicanNo Website Found
Javier O. LopezRepublicanWebsite Down

For Washington State Lieutenant Governor
NamePartyWhy you should vote for
James Robert DealNo Party
Dave T. Sumner, IVNeopopulistNo Website Found
Mark GreeneDemocracy

For Washington State Secretary of State
NamePartyWhy you should vote for
David J AndersonNo Party PreferenceNo Website Found
Sam WrightHuman

For Washington State Treasurer
NamePartyWhy you should vote for

For Washington State Auditor
NamePartyWhy you should vote for

For Washington State Attorney General
NamePartyWhy you should vote for

For Washington State Commissioner of Public Lands
NamePartyWhy you should vote for
Stephen A SharonNo Party PreferenceNo Website Found
Peter J

For Washington State Superintendent of Public Instruction
NameWhy you should vote for
Randy I
Don HanslerNo Website Found
John Patterson
Ronald L (Ron)

For Washington State Insurance Commissioner
NamePartyWhy you should vote for
John R
Brian C BerendIndependentNo Website Found

Tuesday, July 17, 2012

Sage/DREAM Breast Cancer Challenge Launch

This morning's webinar marked the public launch of the Sage/DREAM Breast Cancer Challenge.  The goal is simple:  given a large data set consisting of clinical covariates, copy number data, expression data, and survival data, can you produce the best predictions of survival time for new patients?  The unstated goal is a little bit more subtle:  so far, there is no compelling evidence that I have seen that microscopic information such as gene expression or copy number data adds anything to predictive power in macroscopic behavior such as the survival time of the patient.  We hope that this competition will encourage people to build better models of disease collaboratively, and produce some of the first evidence that gene expression and copy number data can actually be useful in predicting patient prognosis.  As an example, I will produce an entry that you can feel free to cannibalize, and will write more about it as I evolve it into a better model of disease.  As a Sage employee, I am not eligible to win and so I hope you will modify and adapt this code and produce your own entry, superior to my own!

Before trying my code, you will want to check out the Breast Cancer Competition Getting Started Guide.

From an internal competition we ran that preceded this launch, we learned that the R RandomSurvivalForest package produced the best models of survival given the data, and that in fact the most critical part of the model was actually which data is chosen for inclusion and how it is pre-processed before being handed over to RandomSurvivalForest.

First, I will produce the core of my submission, the model class file.  I will take only the clinical covariates and totally ignore the copy number and expression data, though I will leave a space where they can be included later.  In fact, you will find that this transformation function is the most important part of the entire competition:  victory will hinge not on the best model, but instead on cleaning up the data in the best way possible.

View my Model Class File on github.

Next, I will train and submit the model.  This code is relatively straightforward, but if you are going to modify my model for your own purposes, you will want to make sure that you submit to the public leaderboard instead of the Sage leaderboard!

View my Submission File on github.

As you can see, my Random Survival Forest Clinical-Only model did pretty well!  On the Sage employees leaderboard, model "Sauerwine RSFModel test 3" got a respectable test score of 0.71.  In my next post, we'll see how to improve that further.  At the time of this post, it actually does better than any model on the Sage/DREAM Public Leaderboard, but since you're free to take my model and improve on it, that's not likely to be the case for long!

Good luck, and happy modeling!

Sunday, July 15, 2012

Keep it Simple!

I've updated my CV to reflect some success I've had lately in TopCoder Marathon Matches.  I competed in the Harvard Medical School #3 Competition, where I did dismally, and the NASA Tournament Lab "SynchronousControl" Weekend Competition, where I got a 4th place award.

I learned a lot from these contests, so I'll comment a bit on how they went and how I could have improved.

In the HMS #3 Competition, users were asked to examine simulated gene marker time series data to determine which genes were being most strongly selected for and against.  This boiled down to doing a regression on some data points from a Markov chain, and trying to determine what the parameters of the Markov chain were.  In principle, it was actually pretty straightforward.  Simply returning "parameter = (x1-1)/x0" actually did pretty well--better than my solution!  I had tried several different and more advanced regressions, as well as methods to try to account for uncertainty on the data points.  Ultimately, it turns out that all of this complex math did slightly worse than the obvious solution stated above.

The question itself had some problems, in my mind.  First, they assume that the reading at any given time point was correct and free of noise for the purposes of the next iteration of the Markov chain.  Second, the nature of the problem was such that you could not nominate a gene for being both a "most up-regulated" and "most down-regulated" gene.  Sometimes, the noise turned out to be so extreme that my best guess would have been to nominate a gene for both categories!  Finally, the simulated data was modified so that no gene would ever have a reading of zero, which is not realistic given the nature of the experiment.  So, I'm not entirely certain that the researchers who ran the contest are ultimately going to find that the solution is what they actually wanted in the first place.

What I should have done, in retrospect, is the following:  First, I should have immediately dissected and worked backwards from the data simulation program.  The problem description was not exactly representative of the way the data was being simulated, and understanding this earlier certainly would have affected my progress.  Simulating a lot of my own data for my own test harness also would have been instrumental.  Second, I found that very complex methods are actually not all that much better than very simple methods.  I should have taken all of the simple models I tried and compared them in parallel to simulated data and understood the domains in which each model worked the best, then attempted to identify the domain that each data set was in and used the best simple model per task.

So, I say that I did "dismally" because the extremely complex model I settled on was embarrassingly worse than the simplest possible model.

To be fair, the NTL competition that I got a 4th place award in didn't go a lot better.  The problem was to determine what moves to make to help a bunch of robots escape from a maze, under the condition that all of the robots must make the same set of moves!  The principal challenge was that the program to find these moves had to complete in under 2 seconds, which left time for just one A* or BFS in the largest possible map.  I spent way too much time working on a solution that I had to discard because it performed more searches than could be conducted in 2 seconds.

If I had realized this earlier, I would have focused on a heuristic to help robots cooperatively reach their goals instead of focusing on ways to greedily get each robot out individually in the best time.

The algorithm I ultimately used was this:  "Identify the closest robot to any goal.  Perform the minimum number of moves to get that robot out."  This crude method was sufficient to get a 4th place award in my room, $50.  What I would have done if I had to do it again would be this:  "Identify the closest robot to any goal.  Perform the move that minimizes the average distance between all robots and any goal, subject to the constraint that the closest robot must either move closer or stay the same distance away."

The overall lesson was this:  keep it simple.  I often find myself gravitating towards some complex custom algorithm which may either grossly overfit in the case of the HMS competition or may spend way too much processor time grinding away towards a solution when a crude heuristic could have approximated a better solution much more quickly.  When working on these open-ended problems, I need to fight the urge to write a complex algorithm, stay focused on the goal, and home in on the best solution using increasingly better heuristics.

Sunday, July 8, 2012

Vacation: Olympic National Rainforest

For her birthday, I flew my friend Lilli over to Seattle to explore the Olympic National Rainforest.  I'd seen pictures of the idyllic mosses and glaciers near the Hoh river, and I must say that it's something you have to see in person.  The forest is incredibly lush and green, but it's often hard to take a good picture because it seems like something is always in the way of your shot, and it was difficult to find wildlife because there were way too many hiding spots.  To make matters worse, I grossly underestimated how cold it got there at night, and overestimated how waterproof my tent would be.  I'll have to try to make the 18 mile hike up to the glacier meadows some other time!  The park as a whole is wonderfully maintained, and the rangers and park staff were pleasant and helpful--just make sure you come prepared for the moisture, cold nights, and permethrin-resistant mosquitoes.

A columbine flower, Aquilegia formosa

Scaphinotus angusticollis looking for a meal on a mossy trunk.
The elk were quite unafraid of people.
The trees frequently had hollow openings under them.  This effect occurs because the only sunlit real estate on the forest floor is where another tree has recently fallen!  This tree began its life growing on the trunk of a fallen predecessor, and now that progenitor has rotted away.  

Monday, July 2, 2012

What Is the Most Dangerous Drug?

The recent Miami cannibal attack brought "bath salts" to the attention of the general public.  Bath salts are not the Epsom salts that sit unused in bathrooms nationwide, but are actually designer drugs that are sold as bath salts in order to circumvent regulation of the component chemical (typically MDPV) and avoid legal responsibility for the results of human consumption.  These designer drugs are frequently legal for possession or consumption since they are not controlled substances, but instead are chemically similar to controlled substances on the theory that similar chemicals should have similar recreational effects.

While reading about these so-called "research chemicals" at Erowid, a project that purports to provide honest information about drug effects and dosage for users and health care providers, I became interested in the stories of user experiences with these poorly-understood substances.  Assessing the similarities and differences between user accounts, it appeared that despite the small sample size there were indeed typical presentations of the effects of each drug.  If the thematic saturation from just a dozen or so accounts were enough to assemble a list of common experiences and side effects, then given the enormous number of user accounts at Erowid can I draw some conclusions about the relative danger of drugs in spite of the obvious self-selection bias?  A few notes before I start:
  • I will discard all drugs with fewer than 100 total user accounts from my study.  If you would like to repeat this study from the R code with a different threshold, source is provided.
  • The reports at Erowid are self-selected.  For example, you will see in Table 2 that alcohol use is less reported than cocaine.  Obviously, a smaller proportion of alcohol users see fit to report their experiences than cocaine users since most adults drink alcohol.  I cannot account for this self-selection bias, and it certainly taints all of my results.
  • Data is collected from all user accounts at Erowid on June 27, 2012.
  • I have made the R code for this web scraping experiment available here.  It includes web scraping code, an example of elastic net models in R with the glmnet package, labeled scatterplots, and how to produce normalized word cloud data.
  • The dose makes the poison, and I haven't done anything to account for the size of the dose nor the method of drug delivery.  Each case in Erowid should be considered a recreational or abusive dose.  
  • I in no way account for "fatally bad experiences" versus just "didn't enjoy it."  You will learn below that the risk for caffeine is quite high because the users take absurdly high amounts of it and are merely uncomfortable during the experience.  There are other drugs where a high dose will do far worse than make you uncomfortable.
  • There is no intrinsically safe substance.  None of these results should be construed as an endorsement of any illegal drug use.
First, I used Erowid's search feature to retrieve all user experiences, the drugs that were taken during the session, and the category of report.  I assembled these into large binary matrices, where the category of the report is the response variable and the drugs taken were the features.  I did not discriminate between large and small doses, and only counted whether or not a drug was taken at all during an experience.

Next, I decided to group the response variables a bit more coarsely. In my study, I will refer to number of reports, reward, risk, and addictiveness.

Table 1:  Categories of experiences in my study versus Erowid.

My StudyErowid Category
Number of ReportsTotal number of data points in all categories
Reward"Glowing Experiences", "Mystical Experiences", "Health Benefits"
Physical Risk"Health Problems", "Train Wrecks & Trip Disasters", "Addiction & Habituation"
Addictive"Addiction & Habituation"

Finally, I will build an elastic net model to determine which drugs are most associated with each category.  Because the majority of experience reports at Erowid involve a cocktail of drugs, it is necessary to use a predictive model that will be able to give me some idea as to the risks of individual elements when several are taken together.  In R, the code looks like this:

rewardModelGLMNET <- glmnet(featureTable, rewardTable, family="binomial")

This is all that is required to extract coefficients for each of reward, risk and addictiveness.  I present the coefficients in a model using all of the drugs with at least 100 experiences in Table 2.

Table 2:  Reward, Risk, Addictiveness and number of reports of drugs from Erowid.  Unitless model coefficients are given, which gauge the relative reward, risk or addictive nature compared to other drugs on the list.  Note the obvious self-selection bias visible in the list of reported drugs, where cocaine and opiates are ranked more highly than the relatively ubiquitous alcohol. 

Most Rewarding
DMT 0.94
Ayahuasca 0.89
Cacti 0.83
2C-I 0.79
Mushrooms 0.76
Pharms - Paroxetine 0.72
MDMA 0.72
Nitrous Oxide 0.62
5-MeO-DMT 0.52
Morning Glory 0.48
Methylone 0.45
H.B. Woodrose 0.43
DPT 0.4
Hydrocodone 0.39
AMT 0.36
LSD 0.32
Salvia divinorum 0.29
Mimosa spp. 0.28
Kava 0.27
Kratom 0.26
2C-E 0.25
2C-T-2 0.23
Harmala Alkaloids 0.22
Melatonin 0.22
Ketamine 0.21
2C-B 0.16
Opioids 0.13
Pharms - Buprenorphine 0.12
Cannabis 0.11
Pharms - Venlafaxine 0.08
Modafinil 0
Pharms - Alprazolam -0.02
Pharms - Tramadol -0.02
2C-T-7 -0.05
Heroin -0.05
Anadenanthera spp. -0.07
Pharmaceuticals -0.07
5-MeO-DiPT -0.12
Pharms - Clonazepam -0.12
Amphetamines -0.12
Tobacco -0.13
Pharms - Oxycodone -0.24
Opiates -0.27
Amanitas -0.31
Alcohol -0.33
Benzodiazepines -0.33
Syrian Rue -0.34
Pharms - Methylphenidate -0.41
Codeine -0.42
Cannabinoid Receptor Agonists -0.43
Methamphetamine -0.51
Pharms - Bupropion -0.54
Absinthe -0.57
GHB -0.6
DXM -0.6
Spice and Synthetic Cannabinoids -0.78
Nutmeg -0.82
Pharms - Zolpidem -0.82
Cocaine -0.84
Datura -0.95
Caffeine (extreme dose)-0.98
5-MeO-AMT -1
Inhalants -1.1
Piperazines -1.1
SSRIs -1.12
Diphenhydramine -2.01
Dimenhydrinate -2.44
Least Rewarding
Most Risky
GHB 1.62
Methamphetamine 1.51
Heroin 1.35
Cocaine 1.16
Pharms - Venlafaxine 1.15
Inhalants 1.09
Opioids 0.93
Tobacco 0.91
Caffeine (extreme dose)0.9
Syrian Rue 0.85
Pharms - Buprenorphine 0.82
Amphetamines 0.76
Pharms - Tramadol 0.73
Pharms - Paroxetine 0.69
Pharmaceuticals 0.68
DXM 0.64
Alcohol 0.48
Datura 0.47
Pharms - Zolpidem 0.45
Dimenhydrinate 0.42
Pharms - Methylphenidate 0.4
Diphenhydramine 0.38
5-MeO-AMT 0.32
Benzodiazepines 0.25
Nitrous Oxide 0.2
Opiates 0.19
AMT 0.18
MDMA 0.16
Pharms - Bupropion 0.13
SSRIs 0.1
2C-T-7 0.05
Piperazines 0.02
Kratom -0.01
Modafinil -0.08
Pharms - Oxycodone -0.11
Pharms - Alprazolam -0.15
Spice and Synthetic Cannabinoids -0.16
Cannabis -0.19
Cannabinoid Receptor Agonists -0.2
Ketamine -0.22
LSD -0.29
Codeine -0.3
Morning Glory -0.32
Amanitas -0.39
Pharms - Clonazepam -0.39
5-MeO-DMT -0.42
Nutmeg -0.5
Hydrocodone -0.5
DPT -0.53
5-MeO-DiPT -0.63
H.B. Woodrose -0.65
2C-T-2 -0.7
Harmala Alkaloids -0.73
Melatonin -0.76
2C-E -0.78
Methylone -0.83
2C-B -0.85
Kava -0.92
Ayahuasca -0.93
2C-I -0.94
Mushrooms -1.02
Mimosa spp. -1.09
Anadenanthera spp. -1.27
Absinthe -1.31
DMT -2.14
Cacti -2.17
Salvia divinorum -2.19
Least Risky

Most Addictive
Methamphetamine 2.21
Cocaine 2.05
Opioids 1.53
Pharms - Buprenorphine 1.46
Heroin 1.43
Pharms - Venlafaxine 1.38
Amphetamines 1.33
Tobacco 1.31
Pharms - Paroxetine 1.26
Pharms - Tramadol 1.2
GHB 1.08
Pharms - Methylphenidate 0.93
Opiates 0.67
Benzodiazepines 0.67
Kratom 0.65
Pharms - Clonazepam 0.65
Inhalants 0.55
Caffeine (extreme dose)0.52
Nitrous Oxide 0.39
Pharms - Alprazolam 0.3
Modafinil 0.07
DXM 0.03
Pharms - Zolpidem -0.03
Ketamine -0.08
Pharms - Bupropion -0.1
Alcohol -0.15
Pharms - Oxycodone -0.16
MDMA -0.16
Dimenhydrinate -0.22
Diphenhydramine -0.26
Cannabinoid Receptor Agonists -0.3
Pharmaceuticals -0.31
Codeine -0.45
Harmala Alkaloids -0.51
Methylone -0.52
Cannabis -0.67
Melatonin -0.81
Absinthe -0.83
Hydrocodone -0.83
5-MeO-AMT -0.88
SSRIs -1.04
2C-T-2 -1.23
Piperazines -1.24
Spice and Synthetic Cannabinoids -1.28
5-MeO-DMT -1.54
Kava -1.64
AMT -1.7
2C-E -1.83
Datura -1.85
LSD -1.88
2C-T-7 -2.16
5-MeO-DiPT -2.31
DPT -2.32
Mushrooms -2.86
Mimosa spp. -3.35
Syrian Rue -3.81
Salvia divinorum -3.87
Ayahuasca -4.88
Anadenanthera spp. -5.03
Amanitas -5.38
Cacti -5.43
DMT -5.49
H.B. Woodrose -5.6
Nutmeg -5.7
2C-B -5.7
Morning Glory -5.82
2C-I -5.96
Least Addictive
Most Reported Experiences
Mushrooms 1839
Cannabis 1637
Salvia divinorum 1577
MDMA 1354
LSD 1232
DXM 730
Opiates 589
Pharmaceuticals 582
Opioids 544
Cocaine 529
Alcohol 491
Morning Glory 463
2C-I 458
Amphetamines 425
Harmala Alkaloids 414
Methamphetamine 353
5-MeO-DiPT 318
DMT 315
Ketamine 314
5-MeO-DMT 310
SSRIs 305
H.B. Woodrose 304
Syrian Rue 301
2C-E 292
Datura 291
Nutmeg 289
AMT 274
Nitrous Oxide 258
Cacti 256
Diphenhydramine 248
Benzodiazepines 236
2C-T-7 235
Pharms - Tramadol 232
2C-B 229
Caffeine (extreme dose)213
Kratom 209
Dimenhydrinate 207
Pharms - Oxycodone 205
Amanitas 194
Hydrocodone 194
Heroin 191
DPT 185
Inhalants 185
Ayahuasca 179
GHB 172
Kava 172
Pharms - Zolpidem 165
2C-T-2 164
Codeine 161
Pharms - Alprazolam 150
5-MeO-AMT 138
Pharms - Methylphenidate 138
Mimosa spp. 136
Tobacco 132
Pharms - Bupropion 128
Pharms - Paroxetine 124
Pharms - Venlafaxine 124
Modafinil 111
Methylone 109
Absinthe 108
Anadenanthera spp. 107
Melatonin 105
Spice and Synthetic Cannabinoids 104
Pharms - Buprenorphine 101
Piperazines 98
Pharms - Clonazepam 96
Cannabinoid Receptor Agonists 90
Least Reported Experiences

Table 2 gives lists of the most rewarding, risky, addictive and most reported common drugs on Erowid.  I found it interesting that the most rewarding and least addictive drugs tended to be hallucinogens and entheogens, and that the most stereotypically abused drugs (cocaine, heroin, methamphetamine) had higher risk and addiction coefficients but generally lower rates of rewarding experiences. This trend makes me think of the Rat Park Experiment where it was shown that rats that lived happy lives did not become addicts despite easy availability of drugs in their environment.  This supports the idea that methamphetamine, cocaine, and heroin may indeed be more of an avenue of escape than a positive experience in their own right.  The bath salts which I wondered about from the cannibal attack in the news typically contain MDPV, a substance which did not have enough documented experiences on Erowid to pass my criterion for inclusion of having more than 100 user reviews.  

Figure 1:  Risk versus Reward in common drugs on Erowid as calculated by an elastic net model.  Please note that this plot is greatly affected by self-selection bias.  For example, normal users of caffeine think nothing of it, but do not report their experiences.   
Most remarkably, Figure 1 shows that the risk-to-reward ratio of hallucinogens and entheogens tended to be in the "more rewarding, less risky" category, and drugs categorized as research chemicals tended to offer a better reward-to-risk ratio than alcohol, tobacco, cocaine or heroin.

Overall, I was a bit disappointed with the results of this study.  Not shown are numerous plots and statistics that I thought would be enlightening, but in the end were worthless due to the corruption of the self-selection bias.  For example, I was sure there would be a correlation between number of reports and reward coefficient, but in fact I could find nothing significant.  It's possible that availability is a bigger factor in selecting a drug than reward, or that users of some drugs are simply more inclined to share their story.  In order to get better data on the effects of illegal drugs, it would require a survey designed to avoid self-selection bias.

In conclusion,
  • This study was enormously biased.  The popularity section of Table 2 shows a colossal self-selection bias because cocaine is reported far more than alcohol.  I did not account for dose nor for the type of risk, which is obvious because high doses of caffeine were more associated with risk than alcohol.
  • There was not enough data at Erowid to satisfactorily assess the outcomes associated with MDPV, a principal component of many "bath salts".  The R code for this study is provided in case you would like to check for yourself with another threshold.
  • The user experiences at Erowid indicate that that the most rewarding and least risky drugs tend to be hallucinogens and entheogens.  
  • The user experiences at Erowid indicate that the most dangerous drug is GHB, followed by Methamphetamine, Heroin and Cocaine.
  • I do not partake in, endorse or recommend the use of any illegal substance.  There is no intrinsically safe substance.      

As a bonus, I added some code to the R script to produce word cloud data for use with programs such as the freeware program Wordaizer.  To produce these word clouds, I first counted all of the words in a random sample of Erowid experiences, then counted the words in the experiences of just the drug of interest.  By dividing the frequency of each word in the drug-specific set by the frequency of each word in the control sample, I was able to get an idea of which keywords were truly more common and unique to each drug as opposed to Erowid experiences in general.  I then used Wordaizer under Wine in order to produce these attractive word clouds.  The code I used to produce these is freely available.

Figure 2:  Word clouds produced using Wordaizer with Erowid experiences for a selection of drugs, normalized and prepared with R code available above.

Thanks to Simiao for proofreading and advice regarding the design of this study.