Thursday, January 15, 2015

My Review of Workshop on Algorithmic Challenges in Machine Learning 2015

Last week I had the privilege to attend the Workshop on Algorithmic Challenges in Machine Learning (2015) at UCSD. Kamalika and Shachar put together a wonderful set of speakers (PDF) who spoke on a wide range of topics including property testing, nearest neighbor, and sparse coding (and they provided free lunch all three days for a free workshop, which was a nice surprise). I won't go over all of the talks, just a couple that I enjoyed a lot.

Beyond Locality Sensitive Hashing, Piotr Indyk
Piotr talked about their recent results in nearest neighbor algorithms. He started with a nice review of the locality sensitive hashing literature, and apparently ball lattice hashing was the best up until now.

Their (Andoni, Indyk, Nguyen, and Razenshteyn) technique is to preprocess the data to remove "dense" clusters, and then to use a basic Hamming LSH on the rest. Specialized data structures are then used for the dense clusters. The clusters are translated closer to the origin to make them each sparse, and then minwise hashing is applied to each. Upon processing a query point, you see if it is in one of the dense clusters first, and then apply the appropriate subprocedure to return the answer.

He says that this is a small difference, but it allows for enough of an improvement to be significantly better. Furthermore, he talked about how the result by Andoni and Razenshteyn this year improves on this further by doing data-dependent hashing. I thought this was a nice, neat trick to attack the problem.

The Reusable Holdout: Preserving Validity in Adaptive Data, Moritz Hardt
This was probably my favorite talk out of the whole workshop. Moritz brought up Kaggle (a site where data owners can provide rewards for the best model on a dataset, and people compete for the prize by submitting models). The problem that competitors face on Kaggle is that if they're not aware of the problem, then they can overfit their model to the holdout set (a subset of the training set, also called the validation set if you're more used to that term), and do poorly on the test set.

He pointed out that this also happens when people do science ("p-hacking" is a term often used for this), when scientists repeatedly test against the same holdout set. Some proposals to combat the problem involve publishing the entire experimental plan ahead of time, but this isn't always desirable because often you want to explore different methods of obtaining results. Science is an adaptive process.

Moritz summed up overfitting with the following: overfitting is what you get when the expected loss computed over the holdout is less than the expected loss computed over the distribution. The claim is that there's a holdout that can be reused without overfitting. The key? Differential privacy. This is one of those ideas that makes you think "oh, of course" because it so elegantly addresses the problem.  (I also like this because it's one of those ideas that builds on differential privacy that has nothing to do with actual privacy.)

We need that the answer is close to the expectation, not the empirical average, or as Moritz put it, "differential privacy and accuracy on the sample implies accuracy on the distribution." He went into some specifics and an example which I won't get into here, but you should check out the paper.

Other talks
While I'm not going to go into much detail on the other talks, I'll say that the quality was pretty high. I could briefly describe them here but I'll let you browse the abstracts that I linked above. It was definitely worth the trip for a few days.