Ben Li-Sauerwine's Notebook: April 2012

Friday, April 27, 2012

The Nature of Corruption

One of my favorite talks at Sage Congress 2012 was Larry Lessig's talk, "Ingredients for Innovation." If you have any interest in copyright, corruption in politics, or the 99% movement, you should really spend an hour and listen to this. It's brilliantly delivered and easy to follow, so you don't need a PhD and familiarity in the field to understand it!

Ingredients for Innovation

Incidentally, while I was at Sage Congress I met a gentleman who had a copy of the US constitution who he carried around to ask his heroes to sign. He would ask them to sign their favorite clause. Lessig's?

"No Title of Nobility shall be granted by the United States: And no Person holding any Office of Profit or Trust under them, shall, without the Consent of the Congress, accept of any present, Emolument, Office, or Title, of any kind whatever, from any King, Prince, or foreign State."

Lessig put forth the idea that we now have a de facto nobility in the US, and after listening to his talk, I'm inclined to agree.

Tuesday, April 24, 2012

What Science Is

After much discussion about crowdsourcing and democratization of science during the Sage Congress, I feel compelled to commit to writing a revelation--or more accurately a recovered memory--about what science is.

When you are a student studying for exams or preparing your thesis, it's easy to believe that science is about mathematical prowess, or programming ability, or some talent like Newton or Gauss or Bernoulli had in order to compete. If you're a grad student or an undergrad or high school student considering pursuing a degree in science, you must always keep in mind that that is not the case. These are important skills to have, but these are not what science is any more than a powerful engine is what a car is.

Science is first about asking the right questions and second about managing precious resources: your lab, your time, your grant money. This is odd because these are generally not the things you are graded on, and most scientists have unfortunately never had a class in business. If you want to be a good scientist, when you're sitting in your class, you should focus on asking the best question about the material (even if only to yourself) and less on how the material will help you finish your next homework assignment.

So, in that way, our curricula that are supposed to be selecting for the best scientists are actually doing very little to cultivate or test for the most important skills to being a scientist, which leads to a society with vast misconceptions about what science is.

I recovered this memory after some discussions about crowdsourcing. In crowdsourcing projects such as FoldIt, EteRNA, the Netflix Challenge, or TopCoder Marathon Matches, or the Mechanical Turk, a problem is presented to the internet at large in the hope that someone or some group can solve a problem that is otherwise difficult for the researchers. Surprisingly, in crowdsourced projects like FoldIt, EteRNA, or TopCoder, while there are tens of thousands of players only thirty or so provide truly useful results. Put another way, while boasting an impressive draw, these crowdsourcing projects are actually not about getting work from the crowd at large but instead about finding the very best minds to work on your problem. These minds are far better at solving the problem than even the researcher was--but they are not necessarily better scientists! The scientist's job was to pose the question in the first place, and in this regard the crowd is actually more of a special-purpose computer. Nonetheless, our schools are teaching putative scientists how to be better at solving problems instead of teaching us how to pose better questions, which is probably an artifact of an age before Google, instant globalization through the internet, and the personal computer.

Some crowdsourcing problems like FoldIt, EteRNA and TopCoder do indeed seek only to find the very best individuals--and as such the reward structure of TopCoder at least makes sense. When you want just one best solution, it is absolutely logical to provide prizes and recognition to only the best competitors.

Figure 1: Reward Structure in TopCoder Marathon Matches.
Oligarchical utility, oligarchical rewards.

Further, the reward structure of Mechanical Turk also makes sense. When the work is something that is intrinsically valuable, it makes sense to reward everyone just the same for any work they do.

Figure 2: Reward Structure in Mechanical Turk.
Democratic utility, democratic rewards.

Finally, I ask myself whether science is more like TopCoder, where only the top performers have much utility, or like Mechanical Turk, where the work done has intrinsic value as long as it is done to standards. In fact, most intelligent questions, when asked, will not yield Earth-shatteringly important results. That's what science is, though: the systematic testing of hypothesis. Most will fail, but knowledge of what doesn't work is still knowledge, and that's what's important. For example, consider kinases as chemotherapy targets.

There are hundreds of different kinases known to exist in the human body. Most of them are probably not ideal chemotherapy targets, but knowing which ones are could save thousands of lives. So, we study each one equally to try to find which one to target, right?

Wrong.

In fact, just a few dozen kinases are the focus of the vast majority of research, following something of an exponentially decreasing curve in terms of interest (I will cite the appropriate Sage Congress Unplugged talk when it's posted!) Why? Because some rock-star scientist at some point thought it was important, and then the good funding money went with him, and then all of the bread-and-butter scientists followed. So, was the rock-star scientist someone deeply brilliant and insightful, like Einstein or Heisenberg?

Absolutely! ...but that was only one factor in their success. There are many others like them whose only fault was to be unlucky.

In fact, those wishing to pursue new and unique avenues of research are frequently ignored because they don't fit in with what everyone else is doing. Stanley Prusiner of Prion fame struggled to get any funding or resources to study his Nobel prize-winning idea. In fact, Newton or Gauss likely would have struggled to get a job in today's academic climate. The truth is that for every Stanley Prusiner, there are hundreds more like him with potentially revolutionary ideas that don't end up working, and who are forgotten by history. Scientific progress is nearly always based on a shot in the dark, and careers are made and broken by what amounts to a stroke of luck. Good science, whether the study turns out to be successful or not, is intrinsically valuable for what it is. However, the lion's share of rewards go to the few who happened upon the rare theory that turns out to work. This leads to most scientists focusing on whatever the lucky few happened upon instead of something potentially revolutionary, because as long as you're somewhere nearby you can scrounge a few scraps of grant money for reflecting someone else's success.

Figure 3: Reward Structure in the Scientific Establishment.
Democratic utility, oligarchical rewards.

This is a huge problem for the democratization of science. The reward structure looks like one where only the very best solution matters, but the utility is in fact equal to the amount of good science done. This will discourage anyone straying far from the current hot topics, and vastly reduce the total amount of work done. Mechanical Turk would be out of business if only the person who did the most got any reward!

This is why new cancer therapies are so slow in coming. It's dangerous to pursue an approach that nobody's done before, because if you fail you might not get your next grant. You might be passed over for tenure. Your paper might not make it into a journal with a high impact factor. If you aren't the top dog in the field, you have to at least closely follow what the best people are doing so that you can claim that you're nearly as good.

Despite throwing billions of dollars at important problems, we are not making efficient progress because the reward structure does not reflect the utility of good science. The sigmoid function of reward versus work has not only led to the bulk of progress coming from a very small number of people, but also led to a huge reduction in the overall amount of work and the number of radical and innovative hypotheses tested.

... all because those in charge of funding have largely forgotten what science is.

Wednesday, April 18, 2012

Sage Congress 2012

I've been working hard to have my contribution to Sage Congress 2012, and it's going to come down to the wire but will definitely be ready to show you there!

If you haven't heard about it, we're bringing together over 200 of the biggest and rising stars in open science and open research together for a fast-paced two-day brainstorming session. This is going to be incredible--and the best part is that in the true spirit of openness, you're welcome to participate! I strongly encourage you to check out our agenda, watch the free live webcast, download it from our website or iTunes if you miss it, comment using Twitter hashtag #SageCon, and ask questions to twitter user @SageBio!

Sunday, April 8, 2012

Data, Mine.

I've been spending a lot of time at work preparing for a competition that we will unveil at Sage Congress 2012, and I've learned that data mining in the "number of observations << number of parameters" limit is almost a different science than that in the opposite limit. We've all learned about overfitting in our high school statistics classes: the best fit to the data is not the curve that connects the dots, but instead the simplest curve that nonetheless explains most of the variability in our data.

What that meant to me was that I should choose a curve or model with few, or relatively few tuning parameters. It turns out, there are a lot of other great ways to overfit on data when you have more parameters than observations:

You can include high information content parameters that are nonetheless not related to what you want to predict.
You can include parameters that are indeed related to what you want to predict, but are not independent of one another.
You can have families of data points with grossly different behavior under the same parameters. If these families are poorly represented, then overfitting can occur even if you chose perfectly reasonable parameters.

I find myself frequently feeling jealous of those who work in the opposite limit. In the news recently, we learned that Target is able to identify pregnant women from their shopping habits with enough reliability to send them targeted ads. In another news-making incident involving shopping habits, the FBI used grocery store shopping card records to try to identify Iranian terrorists. Though this program was short lived and the results are still unknown, these behaviors we know of are surely the tip of the data mining iceberg. Given enough data points relative to the number of variables of interest, even simple models should be able to divine out any signal that exists.

If people were as free about giving out their medical data as they are about their shopping habits and every little thing they and their friends did every day, I could certainly draw marvellous conclusions about prognosis, risk factors, and the best available treatments. So, it's odd that society has such mistrust of the scientific community while blindly trusting Mark Zuckerberg and company with a treasure trove of personal information.

I'm not going to say whether or not social networking is a good or a bad thing for you. On one hand, if a marketer can come to you and offer you deals on a product you want (or maybe even didn't know you want) based on some information you gave out, isn't that a mutually beneficial thing? On the other hand, even without sophisticated data mining techniques I am reminded on a weekly basis that there is a lot to lose by giving away so much information. I'll add one more fact: If you think that Facebook is only as valuable as its community, then if you divide its $100 billion value by its 845 million users then the average profile is worth $118 to somebody.

To me, the risk of social networking is not worth the reward. I decided the first time I saw MySpace that these sites were content-free, and that I didn't want to be a part of them. So, for those people who keep friend-requesting me and inviting me, please understand that I don't hate you and not want to be your friend. I want to be your real-world friend, not part of your marketing cohort.

Finally, and on a separate note: happy Easter! The picture you see above is from Cave Story, a game I recommend and that you can get and play for free, even on Linux. It has robots, rocket launchers, and rabbits. They're not actually Easter rabbits, unfortunately, but if you play your cards right the robots may learn to love. It's adorable!

Sunday, April 1, 2012

Should the Crowd Pick my Next Video Card?

I helped a friend design a new computer the other day, and part of this process involved doing some research to find the video card that provided the best value for the money.

Conveniently, this website provided benchmark scores and prices for a wide range of modern video cards, and I ultimately found and recommended the one on NewEgg that had the highest benchmark per price.

I was left wondering, though: could I have saved time by simply selecting the most popular video card or the highest rated video card? Should I let the crowd buy my next video card? In this example, I will use the RCurl and foreach packages to download and parse data from NewEgg and compare to the benchmarks from the website above to determine whether or not the crowd makes the best video card decision.

I'm going to state up-front a few assumptions relevant to this study:

I should compare the most popular video card of each variety. Since this is a question of value versus popularity and I am comparing only one video card of each variety, I should choose the most popular one as my example of this type!
The benchmark score is the single point of comparison to determine which video card is best. In reality, there are a lot of great reasons to choose a piece of equipment outside of just performance!
I retrieved my data on March 27, 2012. The direct analysis should not be used to pick a video card far from this date, as the best value may change!
NewEgg's price, ratings, and number of reviews are taken as the market price and the confidence of an expert panel in the product.
For my purposes, value equals performance divided by price.

I like this example, because part of this task involves drawing data from from the web and parsing it. The following source code retrieved my price data from NewEgg:




library(RCurl)
library(foreach)

# Cut and pasted from videocardbenchmarks website
NameVsPerformance <- read.table("~/Desktop/performance.dat")

Prices <- foreach(i = 1:nrow(NameVsPerformance)) %do% {  
  
  # I manually did a search, and got these

  # parameters from the query string.
  result <- getForm("http://www.newegg.com/Product/ProductList.aspx", 
                    .params = c(Submit = "ENE",
                                N = "100007709",
                                DEPA = "0",
                                IsNodeId = "1",
                                Description = gsub("_",

" ",

                                                   NameVsPerformance[i,1]),
                                bop = "And",
                                Order = "REVIEWS",
                                PageSize = "1"
            )
          )
  
  # Some contingencies to make sure

  # that my search was successful
  match <- regexpr("We have found 0 active items that match", result)
  matchstr <- regmatches(result, match)
  
  if (length(matchstr > 0)) {
    return(0)
  }
  
  match <- regexpr("Reduce the number of keywords used", result)
  matchstr <- regmatches(result, match)
  
  if (length(matchstr > 0)) {
    return(0)
  }
  
  match <- regexpr("<strong>0 items</strong>", result)
  matchstr <- regmatches(result, match)
  
  if (length(matchstr > 0)) {
    return(0)
  }
  
  # Regular expression out the price
  match <- regexpr("<strong>[[:digit:]]+</strong><sup>.[[:digit:]]+</sup>", result)
  matchstr <- regmatches(result, match)
  
  # And format it properly
  if (length(matchstr) == 0) {
    return(0) # Price is 0 if none was found.

              # I'll drop these later.
  } else {
    return(gsub("[A-Za-z/<>]", "", matchstr[1]))
  }
  
}

It's a pretty trivial change to grab the Rating (1 to 5 eggs at NewEgg) and Popularity (number of reviews can be taken as a proxy for popularity). I just change the regular expression!




  # For Rating

  match <- regexpr("eggs r[[:digit:]]", result)
  matchstr <- regmatches(result, match)



  # For Number of Reviews (proxy for Popularity)

  match <- regexpr("[(][[:digit:]]+[)]</a>", result)
  matchstr <- regmatches(result, match)

And now I'm ready to make some plots! First, let's take a look at price versus performance.


# Collect the card data.
cardData <- cbind(NameVsPerformance, 
                  as.numeric(Prices), 
                  as.numeric(Ratings), 
                  as.numeric(Buyers), 
                  as.numeric(Prices)/NameVsPerformance[,2])
colnames(cardData) <- c("Name",

                        "Performance",
                        "Price",
                        "Rating",
                        "Popularity",

                        "Value")

# Filter out the not found cards
cardData <- cardData[cardData[,"Price"]>0,]
cardData[,"Value"] <- 1/(cardData[,"Value"])


# Look at price vs performance
plot(cardData[,c("Performance", "Price")])

Plot 1: Price vs Performance

In plot 1, what we see is not surprising: an exponential (or at least nonlinear) increase in price as you go to higher and higher performance. When I fit it to an exponential curve, it said that price roughly doubles for every 1250 performance points. If this plot was linear, it really wouldn't matter what video card we picked out. There are good reasons to believe this should not be linear: at the bleeding edge of technology, the dual effect of customer enthusiasm and limited supply makes cards very expensive. At the low end, competition is based on a limited number of customers looking for a component compatible with their older hardware or interested more in a really low absolute price than in value or performance. When I buy a computer, I want to make sure I get the most for my money. Let's take a look instead at performance versus value.

Plot 2: Value vs Performance

Plot 2 is again not surprising. We see that older technology, lower performance video cards cost more per performance because they may no longer be in production and customers are deciding based on compatibility or absolute price. Meanwhile, the newest technology is also expensive because everyone wants the cool new item on the market but supply is limited. In the middle, we see a lot of great deals to be had on the last iteration of technology where performance is great, but the prices have been dropped to make way for the new line. This is a common phenomenon, visible in everything from cars to video cards--everyone knows that a good time to buy is when they're trying to make room for the new stock! Incidentally, the extreme of value you see is a refurbished GeForce GTX 465, which was out of stock but had been priced to move at $109. Let it be known that these savvy buyers got a great deal! What I really wanted to know, however, was whether I can trust other consumers to make the best possible choice in value. Let's look at customer ratings and popularity versus value.

Plot 3: Value vs Rating

From plot 3 it does appear that NewEgg's customer ratings correlate with value, but with considerable scatter (cards with a zero rating had no reviews at all). There are no one- or two-egg rated products here because I chose the most popular example of each video card type, and nobody wants to buy a buggy graphics card! Indeed, benchmark scores are not the only reason to buy a video card. Maybe you're using an older computer that can't support the newest ones, or you don't care about video performance and just want a low price. So, simply sorting by "best rating" may not be the best way to choose a video card because you don't know what the reviewers' expectations were in the first place. Let's look at one more plot, value versus popularity, before making some closing remarks.

Plot 4: Value vs Popularity (Number of Reviews)

Again, the results shown in plot 4 are all over the board. Interestingly, the most popular video card (which turns out to be a Radeon HD 6850) is actually a really good value! In fact, it's probably a better value than shown because at the time I searched for this card a rebate brought it from $140 down to $125! This plot actually shows another interesting phenomenon: the fact that most video cards have few reviews indicates that a lot of people are swayed to purchase an item based on the "most popular" sort. For instance, the buyers of the second and third most popular video cards did not get a very good value for their money! Some video cards appear to be runaway best sellers, while other excellent values are left in the dust. Indeed, the most popular card is actually a great value for the money, but similar or better value cards do exist that might be otherwise neglected by a customer searching by popularity alone.

From this simple example, I've learned a few things about buying technology online:

Do your own research! You don't know what factors were most important in buying technology for the reviewers.
Very old technology and very new technology tend to be a worse value than last-generation bargains.
The "best rating" and "most popular" sorts at websites sway a lot of people into buying a product. Don't jump off the proverbial cliff just because everyone else is, but looking at the outliers in popularity is a good place to start!

If I were an online retailer, I might add this:

Consider giving value-conscious consumers metrics other than what's popular to sort by: for instance, benchmark score per price.

And if I were an R programmer interested in some new ways to use data:

Rcurl and regular expressions taken together provide a powerful tool to draw data from the web.

Finally, to answer my question: Should the crowd pick my next video card? Maybe not, but what's popular is definitely a good place to start looking!