Wednesday, January 21, 2015

Reproducing Corn Kernel Space (RCKS)

The stuff on the left is from an old labmate who was visiting. I thought it needed a lifting map and a technical name. Those are baby corns on the right. You know, because it's a reproducing corn kernel space.
Old labmates can be very helpful. I'm not sure how appetizing green corn is though.

Thursday, January 15, 2015

My Review of Workshop on Algorithmic Challenges in Machine Learning 2015

Last week I had the privilege to attend the Workshop on Algorithmic Challenges in Machine Learning (2015) at UCSD. Kamalika and Shachar put together a wonderful set of speakers (PDF) who spoke on a wide range of topics including property testing, nearest neighbor, and sparse coding (and they provided free lunch all three days for a free workshop, which was a nice surprise). I won't go over all of the talks, just a couple that I enjoyed a lot.

Beyond Locality Sensitive Hashing, Piotr Indyk
Piotr talked about their recent results in nearest neighbor algorithms. He started with a nice review of the locality sensitive hashing literature, and apparently ball lattice hashing was the best up until now.

Their (Andoni, Indyk, Nguyen, and Razenshteyn) technique is to preprocess the data to remove "dense" clusters, and then to use a basic Hamming LSH on the rest. Specialized data structures are then used for the dense clusters. The clusters are translated closer to the origin to make them each sparse, and then minwise hashing is applied to each. Upon processing a query point, you see if it is in one of the dense clusters first, and then apply the appropriate subprocedure to return the answer.

He says that this is a small difference, but it allows for enough of an improvement to be significantly better. Furthermore, he talked about how the result by Andoni and Razenshteyn this year improves on this further by doing data-dependent hashing. I thought this was a nice, neat trick to attack the problem.

The Reusable Holdout: Preserving Validity in Adaptive Data, Moritz Hardt
This was probably my favorite talk out of the whole workshop. Moritz brought up Kaggle (a site where data owners can provide rewards for the best model on a dataset, and people compete for the prize by submitting models). The problem that competitors face on Kaggle is that if they're not aware of the problem, then they can overfit their model to the holdout set (a subset of the training set, also called the validation set if you're more used to that term), and do poorly on the test set.

He pointed out that this also happens when people do science ("p-hacking" is a term often used for this), when scientists repeatedly test against the same holdout set. Some proposals to combat the problem involve publishing the entire experimental plan ahead of time, but this isn't always desirable because often you want to explore different methods of obtaining results. Science is an adaptive process.

Moritz summed up overfitting with the following: overfitting is what you get when the expected loss computed over the holdout is less than the expected loss computed over the distribution. The claim is that there's a holdout that can be reused without overfitting. The key? Differential privacy. This is one of those ideas that makes you think "oh, of course" because it so elegantly addresses the problem.  (I also like this because it's one of those ideas that builds on differential privacy that has nothing to do with actual privacy.)

We need that the answer is close to the expectation, not the empirical average, or as Moritz put it, "differential privacy and accuracy on the sample implies accuracy on the distribution." He went into some specifics and an example which I won't get into here, but you should check out the paper.

Other talks
While I'm not going to go into much detail on the other talks, I'll say that the quality was pretty high. I could briefly describe them here but I'll let you browse the abstracts that I linked above. It was definitely worth the trip for a few days.

Wednesday, March 12, 2014

Status Update

Time to break my silence. Well, time to get around to posting something on my blog. There are several things that have happened since the last time I posted, in no particular order.

We got a paper into AISTATS 2014. Our paper, "A Geometric Algorithm for Scalable Multiple Kernel Learning" that I wrote with Suresh Venkatasubramanian, Avishek Saha, and Parasaran Raman will have a poster at AISTATS in Reykjavik. This is my first big conference paper where I'm "first author" (and also literally the first author, since my last name comes first in the alphabet). I'll post an arXiv version very soon and it will get a link here.

My student coauthors are no longer students. Avishek graduated some time ago and Parasaran graduated in December. Both are now at Yahoo. I'm now the most senior PhD student in the lab and I'm feeling the pressure.


I did an internship at Google last Summer. I may have talked about my age before, but it's no big secret -- I had my 37th birthday while I was at my internship. Since the movie The Internship came out last Summer, I of course got the receiving end of the jokes since I'm not that far behind Vince Vaughn chronologically. I know I'm not really "supposed" to be doing an internship "at my age," but I don't really care because I feel like I made the right decision to reboot my career.
 
I also did an internship in 2010, and my age kind of made me feel isolated. This time it was different. It was really amusing to talk to the twentysomethings and laugh when I started putting things in perspective for them. I got over the fact that the vast majority a large fraction of PhD students were born in the '90s. They haven't had a chance to feel old when the kids born in 2000 are going to be college freshmen in 5 years. It's delightful to see their reactions.

I didn't just poke fun at the youthful, I also got some work done. My work at Google NYC was in burst detection in topic streams, and it was a lot of fun. My host and I are orbiting around a paper for the work as well.

I interviewed at Google in February for a full-time position. This was my "conversion" interview related to the internship above (to "convert" from an intern to a full-timer). I believe that it went well and I've heard the same from my sources, but I have yet to hear a final decision. Fingers are crossed though.


Monday, May 13, 2013

Things to Remember about That Thing You’re Trying to Create

(My view of the creative process)
  1. The perfect is the enemy of the good.
  2. For a work to be good you have to start it.
  3. Everything has parts. To create a whole you must identify its parts.
  4. Write down the parts, even if you don't know about them.
  5. If you don't know about a part, research it. Then revise that part.
  6. Parts have parts. Go back to 4.
  7. No more parts? Are you sure? Go back to 4.
  8. Why this work? That's a part, go back to 4.
  9. Who else did work like this? That's a part, go back to 4.
  10. Does your work "work"? How do you prove it? That's a part, go back to 4.
  11. Pick the next part and start making it.
  12. If you hate the current part go back to 11.
  13. You hate all the parts? Too bad. Finish.
  14. Go back to 11 until you're done with all the parts.
  15. Ok, now you can make it good.
  16. You can't skip to 15. Go back to 11.
  17. All done? Make it awesome. Make it tell a story.
  18. You think it's awesome? No. It sucks. Show it to someone else.
  19. Did you listen to them? Good. Fix it. Go back to 18.
  20. Do you hate it yet? Yes? It's ready.

Tuesday, February 19, 2013

Research Retrospective

I've been quiet for a while on this blog simply because of work and other things. I tried to keep up with a couple online courses, like the Data Analysis course by Jeff Leek, and I've tried to keep an eye on John Langford & Yann LeCun's class, but Suresh's Computational Geometry course has kept me as busy as I'd like while working on research.

Research-wise, I've been prepping our submission for ICML Cycle III. Now that that's over I have some time to worry about other things, like applying and interviewing for internships.

With the breathing room, I got thinking about something that I enjoy doing now that I found tedious at the beginning of my research career. So here's a little retrospective from my 4th year, looking back at my first:

1st year: No clue how to form a research question
4th year: Able to weed out most of the bad questions myself

I started out thinking about research questions just as "stuff that I thought would be neat if it were true." This isn't the wording I'd have used at the time, but I've developed more experience about how to formulate a research question. Thoughts like

  • "would anyone give a crap?"
  • "someone surely thought of this before -- yep, they sure did," 
  • "OK, so someone thought of this, did they think about this aspect?"
  • "so they didn't think of that, how trivial is it for me to test/find out?"
are all thoughts that I wasn't able to form myself a few years ago. I've still got a ways to go on this front, but I'm able to ignore my own dumb questions. 

1st year: Tracking down a body of work is tedious but necessary
4th year: Tracking down a body of work is rewarding and enjoyable

By "tracking down a body of work" I mean taking the research question that you've vetted and finding all the prior relevant work, digging through citations, doing web searches, etc. I was quite surprised to realize yesterday that I enjoy it. When I first really started research I would avoid papers in favor of presentations (I know, I know). Then I started favoring the papers (reluctantly) and would plod through them.

Perhaps because I now know how to read papers, I can see the tendrils going through the papers and see how the authors, the conferences/pubs, and the work link together in time and space, like those slow-motion videos of a lightning bolt feeling its way to the ground. It sounds like I read too much sci-fi, but if you get it, you know what I mean. The endeavor is much more exciting and rewarding now. A researcher could only come to that point with experience.

1st year: Analyzing a body of work is rewarding and enjoyable
4th year: Analyzing a body of work is rewarding and enjoyable -- but I do it faster now

Analysis, and by that I mean poring over the work and figuring out what's going on, has been my strong suit. I was never a "hacker" in the sense of throwing code together and seeing if it'll work. I was always the kind of coder that figured out what needed to be done, nearly completely, and coded it up.

I think this mostly came from my background of backporting software and writing cross-language code. Both of those activities require you to know every line of code that you work with and what those lines are supposed to do. That, and probably my bachelor's in math (which also helps with theory CS and ML), equipped me to really understand what I was reading, given enough time. And I always loved it. I still do.

The difference now, as I mentioned before, is that I know how to read papers better.  So that process happens a lot faster. It takes me many more papers to become fatigued, which means that I understand the underlying context better.

1st year: Didn't know the process
4th year: Know the process

The rest of research is pretty much just practice (although I have a nagging feeling that I missed something). Any student can work on a problem, it's just a matter of how much padding you need. It's pretty clear in retrospect that a great deal of what your advisor does is to check the padding and take it off when you're ready. Writing, submitting, checking, revising, communicating, speaking, collaborating, all that good stuff comes with practice.

There's topics that I'm omitting on purpose, like grant writing, because I doubt that I'll ever have a lot of exposure to them. But feel free to talk about your experiences in the comments.

Monday, January 7, 2013

Large Scale Machine Learning Class

+Yann LeCun and John Langford are teaching a Large Scale Machine Learning class this semester at NYU:

Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU. This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous.

John says that this isn't a MOOC, but that the notes and lectures are going to be made available online:

We plan to videotape lectures and put them (as well as slides) online, but this is not a MOOC in the sense of online grading and class certificates. I’d prefer that it was, but there are two obstacles: NYU is still figuring out what to do as a University here, and this is not a class that has ever been taught before. Turning previous tutorials and class fragments into coherent subject matter for the 50 students we can support at NYU will be pretty challenging as is. My preference, however, is to enable external participation where it’s easily possible. 
Suggestions or thoughts on the class are welcome :-)

Saturday, December 22, 2012

Data Courses on Coursera

The desire to gain expertise in big data doesn't show any signs of going away, nor is there any slow in the flow of training on the topic: there's three courses on Coursera all starting at about the same time:

They all start next month, January 2nd, 7th, and 22nd respectively. The first looks like an intro course covering basic R programming and analysis basics, and lasts 4 weeks. The other two look like more in-depth analysis training and last 10 and 8 weeks.

Is anyone planning to take any of these courses? Which one?