Ben Li-Sauerwine's Notebook: Data, Mine.

I've been spending a lot of time at work preparing for a competition that we will unveil at Sage Congress 2012, and I've learned that data mining in the "number of observations << number of parameters" limit is almost a different science than that in the opposite limit. We've all learned about overfitting in our high school statistics classes: the best fit to the data is not the curve that connects the dots, but instead the simplest curve that nonetheless explains most of the variability in our data.

What that meant to me was that I should choose a curve or model with few, or relatively few tuning parameters. It turns out, there are a lot of other great ways to overfit on data when you have more parameters than observations:

You can include high information content parameters that are nonetheless not related to what you want to predict.
You can include parameters that are indeed related to what you want to predict, but are not independent of one another.
You can have families of data points with grossly different behavior under the same parameters. If these families are poorly represented, then overfitting can occur even if you chose perfectly reasonable parameters.

I find myself frequently feeling jealous of those who work in the opposite limit. In the news recently, we learned that Target is able to identify pregnant women from their shopping habits with enough reliability to send them targeted ads. In another news-making incident involving shopping habits, the FBI used grocery store shopping card records to try to identify Iranian terrorists. Though this program was short lived and the results are still unknown, these behaviors we know of are surely the tip of the data mining iceberg. Given enough data points relative to the number of variables of interest, even simple models should be able to divine out any signal that exists.

If people were as free about giving out their medical data as they are about their shopping habits and every little thing they and their friends did every day, I could certainly draw marvellous conclusions about prognosis, risk factors, and the best available treatments. So, it's odd that society has such mistrust of the scientific community while blindly trusting Mark Zuckerberg and company with a treasure trove of personal information.

I'm not going to say whether or not social networking is a good or a bad thing for you. On one hand, if a marketer can come to you and offer you deals on a product you want (or maybe even didn't know you want) based on some information you gave out, isn't that a mutually beneficial thing? On the other hand, even without sophisticated data mining techniques I am reminded on a weekly basis that there is a lot to lose by giving away so much information. I'll add one more fact: If you think that Facebook is only as valuable as its community, then if you divide its $100 billion value by its 845 million users then the average profile is worth $118 to somebody.

To me, the risk of social networking is not worth the reward. I decided the first time I saw MySpace that these sites were content-free, and that I didn't want to be a part of them. So, for those people who keep friend-requesting me and inviting me, please understand that I don't hate you and not want to be your friend. I want to be your real-world friend, not part of your marketing cohort.

Finally, and on a separate note: happy Easter! The picture you see above is from Cave Story, a game I recommend and that you can get and play for free, even on Linux. It has robots, rocket launchers, and rabbits. They're not actually Easter rabbits, unfortunately, but if you play your cards right the robots may learn to love. It's adorable!

Ben Li-Sauerwine's Notebook

Sunday, April 8, 2012

Data, Mine.

No comments:

Post a Comment