Ben Li-Sauerwine's Notebook: July 2012

Sunday, July 22, 2012

Washington State Primary Ballot: Who Are They?

I had a new experience yesterday: I got my primary election ballot in the mail! Coming from Pennsylvania and having registered non-partisan, I was not allowed to vote in Pennsylvania primaries. I always thought that was odd: if you're a political party, wouldn't you care about what the people with no party affiliation thought about your candidates as much as or more than those in your party? Since people in your party are not likely to change sides, the independents are the people most likely to be swayed to your side in an election!

Anyhow, I was dismayed when I saw the primary ballot for two reasons. First, I saw that the ballot included such names as "Mike the Mover" and "Goodspaceguy." To me, these non-names don't inspire much confidence in the candidates, and it turns out their websites didn't either. Second, when I started searching for candidates, I noted that many didn't even have a website, or if they did their website didn't include meaningful information about their platform. I don't want this to be a blog about anything but facts and opinion supported directly by facts, so I will keep this post simple. What follows is simply a table of candidates on the Washington State Primary Election ballot, and if I could find a website why you should vote for them in their own words. I hope you will use this information to make the best informed decision in the upcoming primary and election, and may the best candidate win! Please also be aware of the election pamphlet available at the King County website, which I found helpful but incomplete.

Name	Party	Why you should vote for
For United States Senator
Michael Baumgartner	Republican	votebaumgartner.com
Will Baker	Reform	No Website Found
Chuck Jackson	Republican	scaryreality.com
Timmy (Doc) Wilson	Democratic	timwilsonforsenate.org
Art Coday	Republican	artcoday.com
Maria Cantwell	Democratic	cantwell.senate.gov
Glen (Stocky) R. Stockwell	Republican	washingtonstateeconomicdevelopment.vpweb.com
Mike the Mover	Republican	theoriginalmikethemover.com

Name	Party	Why you should vote for
For United States Representative
Don Rivers	Democratic	donriversforcongress.com
Goodspaceguy	Employmentwealth	colonizespace.blogspot.com
Scott Sutherland	GOP	vote-wa.org
Andrew Hughes	Democratic	andrewhughesforcongress.com
Jim McDermott	Democratic	mcdermottforcongress.com
Ron Bemis	Republican	ronbemisforcongress.org
Charles Allen	Democratic	charlesallen2012.com

Name	Party	Why you should vote for
For Washington State Governor
Rob Hill	Democratic	No Website Found
Rob McKenna	Republican	robmckenna.org
Jay Inslee	Democratic	jayinslee.com
James White	Independent	whiteforgovernor2012.com
Christian Joubert	No Party Preference	holisticgovernor.com
Shahram Hadian	Republican	hadian2012.com
L. Dale Sorgen	Independent	imagineliberty.us
Max Sampson	Republican	No Website Found
Javier O. Lopez	Republican	Website Down

Name	Party	Why you should vote for
For Washington State Lieutenant Governor
Glenn Anderson	Republican	glennanderson2012.org
Brad Owen	Democrat	bradowen2012.com
James Robert Deal	No Party Preference	fluoride-class-action.com
Bill Finkbeiner	Republican	billfinkbeiner.org
Dave T. Sumner, IV	Neopopulist	No Website Found
Mark Greene	Democracy Independent	brandnewelections.us

Name	Party	Why you should vote for
For Washington State Secretary of State
Jim Kastama	Democratic	jimkastama.com
David J Anderson	No Party Preference	No Website Found
Sam Wright	Human Rights	thehumanrightsparty.org
Karen Murray	Constitution	murray4sos.org
Kathleen Drew	Democratic	kathleendrew2012.com
Kim Wyman	Republican	kimwyman.com
Greg Nickels	Democratic	gregnickels.com

Name	Party	Why you should vote for
For Washington State Treasurer
Jim McIntire	Democratic	jimmcintire.com

Name	Party	Why you should vote for
For Washington State Auditor
Troy Kelley	Democratic	troykelley.com
James Watkins	Republican	watkinsforauditor.com
Mark Miloscia	Democratic	markmiloscia.com
Craig Pridemore	Democratic	craigpridemore.com

Name	Party	Why you should vote for
For Washington State Attorney General
Bob Ferguson	Democratic	electbobferguson.com
Reagan Dunn	Republican	reagandunn.com
Stephen Pidgeon	Republican	stephenpidgeon4ag.com

Name	Party	Why you should vote for
For Washington State Commissioner of Public Lands
Stephen A Sharon	No Party Preference	No Website Found
Peter J Goldmark	Democratic	petergoldmark.com
Clint Didier	Republican	clintdidier.org

Name	Why you should vote for
For Washington State Superintendent of Public Instruction
James Bauckman	jamesbauckmanforspi.wordpress.com
Randy I Dorn	randydorn2012.com
Don Hansler	No Website Found
John Patterson Blair	johnblairportal.wordpress.com
Ronald L (Ron) Higgins	www.higgins-spi-2012.com

Name	Party	Why you should vote for
For Washington State Insurance Commissioner
John R Adams	Republican	infojohnadams.com
Mike Kreidler	Democratic	mikekreidler.com
Scott Reilly	Republican	scott-reilly.org
Brian C Berend	Independent	No Website Found

Tuesday, July 17, 2012

Sage/DREAM Breast Cancer Challenge Launch

This morning's webinar marked the public launch of the Sage/DREAM Breast Cancer Challenge. The goal is simple: given a large data set consisting of clinical covariates, copy number data, expression data, and survival data, can you produce the best predictions of survival time for new patients? The unstated goal is a little bit more subtle: so far, there is no compelling evidence that I have seen that microscopic information such as gene expression or copy number data adds anything to predictive power in macroscopic behavior such as the survival time of the patient. We hope that this competition will encourage people to build better models of disease collaboratively, and produce some of the first evidence that gene expression and copy number data can actually be useful in predicting patient prognosis. As an example, I will produce an entry that you can feel free to cannibalize, and will write more about it as I evolve it into a better model of disease. As a Sage employee, I am not eligible to win and so I hope you will modify and adapt this code and produce your own entry, superior to my own!

Before trying my code, you will want to check out the Breast Cancer Competition Getting Started Guide.

From an internal competition we ran that preceded this launch, we learned that the R RandomSurvivalForest package produced the best models of survival given the data, and that in fact the most critical part of the model was actually which data is chosen for inclusion and how it is pre-processed before being handed over to RandomSurvivalForest.

First, I will produce the core of my submission, the model class file. I will take only the clinical covariates and totally ignore the copy number and expression data, though I will leave a space where they can be included later. In fact, you will find that this transformation function is the most important part of the entire competition: victory will hinge not on the best model, but instead on cleaning up the data in the best way possible.

View my Model Class File on github.

Next, I will train and submit the model. This code is relatively straightforward, but if you are going to modify my model for your own purposes, you will want to make sure that you submit to the public leaderboard instead of the Sage leaderboard!

View my Submission File on github.

As you can see, my Random Survival Forest Clinical-Only model did pretty well! On the Sage employees leaderboard, model "Sauerwine RSFModel test 3" got a respectable test score of 0.71. In my next post, we'll see how to improve that further. At the time of this post, it actually does better than any model on the Sage/DREAM Public Leaderboard, but since you're free to take my model and improve on it, that's not likely to be the case for long!

Good luck, and happy modeling!

Sunday, July 15, 2012

Keep it Simple!

I've updated my CV to reflect some success I've had lately in TopCoder Marathon Matches. I competed in the Harvard Medical School #3 Competition, where I did dismally, and the NASA Tournament Lab "SynchronousControl" Weekend Competition, where I got a 4th place award.

I learned a lot from these contests, so I'll comment a bit on how they went and how I could have improved.

In the HMS #3 Competition, users were asked to examine simulated gene marker time series data to determine which genes were being most strongly selected for and against. This boiled down to doing a regression on some data points from a Markov chain, and trying to determine what the parameters of the Markov chain were. In principle, it was actually pretty straightforward. Simply returning "parameter = (x1-1)/x0" actually did pretty well--better than my solution! I had tried several different and more advanced regressions, as well as methods to try to account for uncertainty on the data points. Ultimately, it turns out that all of this complex math did slightly worse than the obvious solution stated above.

The question itself had some problems, in my mind. First, they assume that the reading at any given time point was correct and free of noise for the purposes of the next iteration of the Markov chain. Second, the nature of the problem was such that you could not nominate a gene for being both a "most up-regulated" and "most down-regulated" gene. Sometimes, the noise turned out to be so extreme that my best guess would have been to nominate a gene for both categories! Finally, the simulated data was modified so that no gene would ever have a reading of zero, which is not realistic given the nature of the experiment. So, I'm not entirely certain that the researchers who ran the contest are ultimately going to find that the solution is what they actually wanted in the first place.

What I should have done, in retrospect, is the following: First, I should have immediately dissected and worked backwards from the data simulation program. The problem description was not exactly representative of the way the data was being simulated, and understanding this earlier certainly would have affected my progress. Simulating a lot of my own data for my own test harness also would have been instrumental. Second, I found that very complex methods are actually not all that much better than very simple methods. I should have taken all of the simple models I tried and compared them in parallel to simulated data and understood the domains in which each model worked the best, then attempted to identify the domain that each data set was in and used the best simple model per task.

So, I say that I did "dismally" because the extremely complex model I settled on was embarrassingly worse than the simplest possible model.

To be fair, the NTL competition that I got a 4th place award in didn't go a lot better. The problem was to determine what moves to make to help a bunch of robots escape from a maze, under the condition that all of the robots must make the same set of moves! The principal challenge was that the program to find these moves had to complete in under 2 seconds, which left time for just one A* or BFS in the largest possible map. I spent way too much time working on a solution that I had to discard because it performed more searches than could be conducted in 2 seconds.

If I had realized this earlier, I would have focused on a heuristic to help robots cooperatively reach their goals instead of focusing on ways to greedily get each robot out individually in the best time.

The algorithm I ultimately used was this: "Identify the closest robot to any goal. Perform the minimum number of moves to get that robot out." This crude method was sufficient to get a 4th place award in my room, $50. What I would have done if I had to do it again would be this: "Identify the closest robot to any goal. Perform the move that minimizes the average distance between all robots and any goal, subject to the constraint that the closest robot must either move closer or stay the same distance away."

The overall lesson was this: keep it simple. I often find myself gravitating towards some complex custom algorithm which may either grossly overfit in the case of the HMS competition or may spend way too much processor time grinding away towards a solution when a crude heuristic could have approximated a better solution much more quickly. When working on these open-ended problems, I need to fight the urge to write a complex algorithm, stay focused on the goal, and home in on the best solution using increasingly better heuristics.

Sunday, July 8, 2012

Vacation: Olympic National Rainforest

For her birthday, I flew my friend Lilli over to Seattle to explore the Olympic National Rainforest. I'd seen pictures of the idyllic mosses and glaciers near the Hoh river, and I must say that it's something you have to see in person. The forest is incredibly lush and green, but it's often hard to take a good picture because it seems like something is always in the way of your shot, and it was difficult to find wildlife because there were way too many hiding spots. To make matters worse, I grossly underestimated how cold it got there at night, and overestimated how waterproof my tent would be. I'll have to try to make the 18 mile hike up to the glacier meadows some other time! The park as a whole is wonderfully maintained, and the rangers and park staff were pleasant and helpful--just make sure you come prepared for the moisture, cold nights, and permethrin-resistant mosquitoes.

A columbine flower, Aquilegia formosa

Scaphinotus angusticollis looking for a meal on a mossy trunk.

The elk were quite unafraid of people.

The trees frequently had hollow openings under them. This effect occurs because the only sunlit real estate on the forest floor is where another tree has recently fallen! This tree began its life growing on the trunk of a fallen predecessor, and now that progenitor has rotted away.

Monday, July 2, 2012

What Is the Most Dangerous Drug?

The recent Miami cannibal attack brought "bath salts" to the attention of the general public. Bath salts are not the Epsom salts that sit unused in bathrooms nationwide, but are actually designer drugs that are sold as bath salts in order to circumvent regulation of the component chemical (typically MDPV) and avoid legal responsibility for the results of human consumption. These designer drugs are frequently legal for possession or consumption since they are not controlled substances, but instead are chemically similar to controlled substances on the theory that similar chemicals should have similar recreational effects.

While reading about these so-called "research chemicals" at Erowid, a project that purports to provide honest information about drug effects and dosage for users and health care providers, I became interested in the stories of user experiences with these poorly-understood substances. Assessing the similarities and differences between user accounts, it appeared that despite the small sample size there were indeed typical presentations of the effects of each drug. If the thematic saturation from just a dozen or so accounts were enough to assemble a list of common experiences and side effects, then given the enormous number of user accounts at Erowid can I draw some conclusions about the relative danger of drugs in spite of the obvious self-selection bias? A few notes before I start:

I will discard all drugs with fewer than 100 total user accounts from my study. If you would like to repeat this study from the R code with a different threshold, source is provided.
The reports at Erowid are self-selected. For example, you will see in Table 2 that alcohol use is less reported than cocaine. Obviously, a smaller proportion of alcohol users see fit to report their experiences than cocaine users since most adults drink alcohol. I cannot account for this self-selection bias, and it certainly taints all of my results.
Data is collected from all user accounts at Erowid on June 27, 2012.
I have made the R code for this web scraping experiment available here. It includes web scraping code, an example of elastic net models in R with the glmnet package, labeled scatterplots, and how to produce normalized word cloud data.
The dose makes the poison, and I haven't done anything to account for the size of the dose nor the method of drug delivery. Each case in Erowid should be considered a recreational or abusive dose.
I in no way account for "fatally bad experiences" versus just "didn't enjoy it." You will learn below that the risk for caffeine is quite high because the users take absurdly high amounts of it and are merely uncomfortable during the experience. There are other drugs where a high dose will do far worse than make you uncomfortable.
There is no intrinsically safe substance. None of these results should be construed as an endorsement of any illegal drug use.

First, I used Erowid's search feature to retrieve all user experiences, the drugs that were taken during the session, and the category of report. I assembled these into large binary matrices, where the category of the report is the response variable and the drugs taken were the features. I did not discriminate between large and small doses, and only counted whether or not a drug was taken at all during an experience.

Next, I decided to group the response variables a bit more coarsely. In my study, I will refer to number of reports, reward, risk, and addictiveness.

Table 1: Categories of experiences in my study versus Erowid.

My Study	Erowid Category
Number of Reports	Total number of data points in all categories
Reward	"Glowing Experiences", "Mystical Experiences", "Health Benefits"
Physical Risk	"Health Problems", "Train Wrecks & Trip Disasters", "Addiction & Habituation"
Addictive	"Addiction & Habituation"

Finally, I will build an elastic net model to determine which drugs are most associated with each category. Because the majority of experience reports at Erowid involve a cocktail of drugs, it is necessary to use a predictive model that will be able to give me some idea as to the risks of individual elements when several are taken together. In R, the code looks like this:

require(glmnet)
rewardModelGLMNET <- glmnet(featureTable, rewardTable, family="binomial")

This is all that is required to extract coefficients for each of reward, risk and addictiveness. I present the coefficients in a model using all of the drugs with at least 100 experiences in Table 2.

Table 2: Reward, Risk, Addictiveness and number of reports of drugs from Erowid. Unitless model coefficients are given, which gauge the relative reward, risk or addictive nature compared to other drugs on the list. Note the obvious self-selection bias visible in the list of reported drugs, where cocaine and opiates are ranked more highly than the relatively ubiquitous alcohol.

Most Rewarding
DMT	0.94
Ayahuasca	0.89
Cacti	0.83
2C-I	0.79
Mushrooms	0.76
Pharms - Paroxetine	0.72
MDMA	0.72
Nitrous Oxide	0.62
5-MeO-DMT	0.52
Morning Glory	0.48
Methylone	0.45
H.B. Woodrose	0.43
DPT	0.4
Hydrocodone	0.39
AMT	0.36
LSD	0.32
Salvia divinorum	0.29
Mimosa spp.	0.28
Kava	0.27
Kratom	0.26
2C-E	0.25
2C-T-2	0.23
Harmala Alkaloids	0.22
Melatonin	0.22
Ketamine	0.21
2C-B	0.16
Opioids	0.13
Pharms - Buprenorphine	0.12
Cannabis	0.11
Pharms - Venlafaxine	0.08
Modafinil	0
Pharms - Alprazolam	-0.02
Pharms - Tramadol	-0.02
2C-T-7	-0.05
Heroin	-0.05
Anadenanthera spp.	-0.07
Pharmaceuticals	-0.07
5-MeO-DiPT	-0.12
Pharms - Clonazepam	-0.12
Amphetamines	-0.12
Tobacco	-0.13
Pharms - Oxycodone	-0.24
Opiates	-0.27
Amanitas	-0.31
Alcohol	-0.33
Benzodiazepines	-0.33
Syrian Rue	-0.34
Pharms - Methylphenidate	-0.41
Codeine	-0.42
Cannabinoid Receptor Agonists	-0.43
Methamphetamine	-0.51
Pharms - Bupropion	-0.54
Absinthe	-0.57
GHB	-0.6
DXM	-0.6
Spice and Synthetic Cannabinoids	-0.78
Nutmeg	-0.82
Pharms - Zolpidem	-0.82
Cocaine	-0.84
Datura	-0.95
Caffeine (extreme dose)	-0.98
5-MeO-AMT	-1
Inhalants	-1.1
Piperazines	-1.1
SSRIs	-1.12
Diphenhydramine	-2.01
Dimenhydrinate	-2.44
Least Rewarding

Most Risky
GHB	1.62
Methamphetamine	1.51
Heroin	1.35
Cocaine	1.16
Pharms - Venlafaxine	1.15
Inhalants	1.09
Opioids	0.93
Tobacco	0.91
Caffeine (extreme dose)	0.9
Syrian Rue	0.85
Pharms - Buprenorphine	0.82
Amphetamines	0.76
Pharms - Tramadol	0.73
Pharms - Paroxetine	0.69
Pharmaceuticals	0.68
DXM	0.64
Alcohol	0.48
Datura	0.47
Pharms - Zolpidem	0.45
Dimenhydrinate	0.42
Pharms - Methylphenidate	0.4
Diphenhydramine	0.38
5-MeO-AMT	0.32
Benzodiazepines	0.25
Nitrous Oxide	0.2
Opiates	0.19
AMT	0.18
MDMA	0.16
Pharms - Bupropion	0.13
SSRIs	0.1
2C-T-7	0.05
Piperazines	0.02
Kratom	-0.01
Modafinil	-0.08
Pharms - Oxycodone	-0.11
Pharms - Alprazolam	-0.15
Spice and Synthetic Cannabinoids	-0.16
Cannabis	-0.19
Cannabinoid Receptor Agonists	-0.2
Ketamine	-0.22
LSD	-0.29
Codeine	-0.3
Morning Glory	-0.32
Amanitas	-0.39
Pharms - Clonazepam	-0.39
5-MeO-DMT	-0.42
Nutmeg	-0.5
Hydrocodone	-0.5
DPT	-0.53
5-MeO-DiPT	-0.63
H.B. Woodrose	-0.65
2C-T-2	-0.7
Harmala Alkaloids	-0.73
Melatonin	-0.76
2C-E	-0.78
Methylone	-0.83
2C-B	-0.85
Kava	-0.92
Ayahuasca	-0.93
2C-I	-0.94
Mushrooms	-1.02
Mimosa spp.	-1.09
Anadenanthera spp.	-1.27
Absinthe	-1.31
DMT	-2.14
Cacti	-2.17
Salvia divinorum	-2.19
Least Risky

Most Addictive
Methamphetamine	2.21
Cocaine	2.05
Opioids	1.53
Pharms - Buprenorphine	1.46
Heroin	1.43
Pharms - Venlafaxine	1.38
Amphetamines	1.33
Tobacco	1.31
Pharms - Paroxetine	1.26
Pharms - Tramadol	1.2
GHB	1.08
Pharms - Methylphenidate	0.93
Opiates	0.67
Benzodiazepines	0.67
Kratom	0.65
Pharms - Clonazepam	0.65
Inhalants	0.55
Caffeine (extreme dose)	0.52
Nitrous Oxide	0.39
Pharms - Alprazolam	0.3
Modafinil	0.07
DXM	0.03
Pharms - Zolpidem	-0.03
Ketamine	-0.08
Pharms - Bupropion	-0.1
Alcohol	-0.15
Pharms - Oxycodone	-0.16
MDMA	-0.16
Dimenhydrinate	-0.22
Diphenhydramine	-0.26
Cannabinoid Receptor Agonists	-0.3
Pharmaceuticals	-0.31
Codeine	-0.45
Harmala Alkaloids	-0.51
Methylone	-0.52
Cannabis	-0.67
Melatonin	-0.81
Absinthe	-0.83
Hydrocodone	-0.83
5-MeO-AMT	-0.88
SSRIs	-1.04
2C-T-2	-1.23
Piperazines	-1.24
Spice and Synthetic Cannabinoids	-1.28
5-MeO-DMT	-1.54
Kava	-1.64
AMT	-1.7
2C-E	-1.83
Datura	-1.85
LSD	-1.88
2C-T-7	-2.16
5-MeO-DiPT	-2.31
DPT	-2.32
Mushrooms	-2.86
Mimosa spp.	-3.35
Syrian Rue	-3.81
Salvia divinorum	-3.87
Ayahuasca	-4.88
Anadenanthera spp.	-5.03
Amanitas	-5.38
Cacti	-5.43
DMT	-5.49
H.B. Woodrose	-5.6
Nutmeg	-5.7
2C-B	-5.7
Morning Glory	-5.82
2C-I	-5.96
Least Addictive

Most Reported Experiences
Mushrooms	1839
Cannabis	1637
Salvia divinorum	1577
MDMA	1354
LSD	1232
DXM	730
Opiates	589
Pharmaceuticals	582
Opioids	544
Cocaine	529
Alcohol	491
Morning Glory	463
2C-I	458
Amphetamines	425
Harmala Alkaloids	414
Methamphetamine	353
5-MeO-DiPT	318
DMT	315
Ketamine	314
5-MeO-DMT	310
SSRIs	305
H.B. Woodrose	304
Syrian Rue	301
2C-E	292
Datura	291
Nutmeg	289
AMT	274
Nitrous Oxide	258
Cacti	256
Diphenhydramine	248
Benzodiazepines	236
2C-T-7	235
Pharms - Tramadol	232
2C-B	229
Caffeine (extreme dose)	213
Kratom	209
Dimenhydrinate	207
Pharms - Oxycodone	205
Amanitas	194
Hydrocodone	194
Heroin	191
DPT	185
Inhalants	185
Ayahuasca	179
GHB	172
Kava	172
Pharms - Zolpidem	165
2C-T-2	164
Codeine	161
Pharms - Alprazolam	150
5-MeO-AMT	138
Pharms - Methylphenidate	138
Mimosa spp.	136
Tobacco	132
Pharms - Bupropion	128
Pharms - Paroxetine	124
Pharms - Venlafaxine	124
Modafinil	111
Methylone	109
Absinthe	108
Anadenanthera spp.	107
Melatonin	105
Spice and Synthetic Cannabinoids	104
Pharms - Buprenorphine	101
Piperazines	98
Pharms - Clonazepam	96
Cannabinoid Receptor Agonists	90
Least Reported Experiences

Table 2 gives lists of the most rewarding, risky, addictive and most reported common drugs on Erowid. I found it interesting that the most rewarding and least addictive drugs tended to be hallucinogens and entheogens, and that the most stereotypically abused drugs (cocaine, heroin, methamphetamine) had higher risk and addiction coefficients but generally lower rates of rewarding experiences. This trend makes me think of the Rat Park Experiment where it was shown that rats that lived happy lives did not become addicts despite easy availability of drugs in their environment. This supports the idea that methamphetamine, cocaine, and heroin may indeed be more of an avenue of escape than a positive experience in their own right. The bath salts which I wondered about from the cannibal attack in the news typically contain MDPV, a substance which did not have enough documented experiences on Erowid to pass my criterion for inclusion of having more than 100 user reviews.

Figure 1: Risk versus Reward in common drugs on Erowid as calculated by an elastic net model. Please note that this plot is greatly affected by self-selection bias. For example, normal users of caffeine think nothing of it, but do not report their experiences.

Most remarkably, Figure 1 shows that the risk-to-reward ratio of hallucinogens and entheogens tended to be in the "more rewarding, less risky" category, and drugs categorized as research chemicals tended to offer a better reward-to-risk ratio than alcohol, tobacco, cocaine or heroin.

Overall, I was a bit disappointed with the results of this study. Not shown are numerous plots and statistics that I thought would be enlightening, but in the end were worthless due to the corruption of the self-selection bias. For example, I was sure there would be a correlation between number of reports and reward coefficient, but in fact I could find nothing significant. It's possible that availability is a bigger factor in selecting a drug than reward, or that users of some drugs are simply more inclined to share their story. In order to get better data on the effects of illegal drugs, it would require a survey designed to avoid self-selection bias.

In conclusion,

This study was enormously biased. The popularity section of Table 2 shows a colossal self-selection bias because cocaine is reported far more than alcohol. I did not account for dose nor for the type of risk, which is obvious because high doses of caffeine were more associated with risk than alcohol.
There was not enough data at Erowid to satisfactorily assess the outcomes associated with MDPV, a principal component of many "bath salts". The R code for this study is provided in case you would like to check for yourself with another threshold.
The user experiences at Erowid indicate that that the most rewarding and least risky drugs tend to be hallucinogens and entheogens.
The user experiences at Erowid indicate that the most dangerous drug is GHB, followed by Methamphetamine, Heroin and Cocaine.
I do not partake in, endorse or recommend the use of any illegal substance. There is no intrinsically safe substance.

As a bonus, I added some code to the R script to produce word cloud data for use with programs such as the freeware program Wordaizer. To produce these word clouds, I first counted all of the words in a random sample of Erowid experiences, then counted the words in the experiences of just the drug of interest. By dividing the frequency of each word in the drug-specific set by the frequency of each word in the control sample, I was able to get an idea of which keywords were truly more common and unique to each drug as opposed to Erowid experiences in general. I then used Wordaizer under Wine in order to produce these attractive word clouds. The code I used to produce these is freely available.

Ayahuasca

Caffeine

Cannabis

GHB

Inhalants

LSD

Cocaine

Heroin

Meth

Figure 2: Word clouds produced using Wordaizer with Erowid experiences for a selection of drugs, normalized and prepared with R code available above.

Thanks to Simiao for proofreading and advice regarding the design of this study.