Learning Deep Learning

Learning about Deep Learning that is.

It’s a specialized area of neural networks which itself is a specialized area of machine learning. Deep Learning is hot right now with very impressive results on tasks which have been difficult for other machine learning methods to solve.

I need to learn more and now. But how? Over the last few years the rise of the MOOCs has been providing a rich stream of learning possibilities for a huge range of subjects including high end tech as in Andrew Ng’s Coursera Machine Learning class, where basic neural networks were covered as part of the course.

Currently there is no Deep Learning MOOC. This will need to be a self-constructed curriculum. Who are the leaders in the field? The University of Toronto’s Geoffrey Hinton is on top, very closely followed by Yann LeCun of NYU and Yoshua Bengio at the University of Montreal. Last year Hinton was hired by Google and LeCun was hired by Facebook. Both retain their academic positions and part time teaching responsibility.

That’s the background. The rest of this post describes my start on learning about Deep Learning. The journey is far from over. These are only my initial steps and a few thoughts about where to go next.

Hinton’s Coursera Neural Networks class was offered in October 2012. It is an advanced neural networks class that goes way beyond the simple nets covered in Ng’s course. It is not specifically a deep learning class but does spend several lectures on the topic. The course has not been offered again and I don’t think it will. Hinton must be very busy with Google and indicates on his web page that he is not accepting new students for graduate study.

I looked at some of the archived video from this class. Even after Ng’s class I did not feel like I had the background to really appreciate the material. These lectures get deep fast. I need to prepare. Also for my particular learning style I want the “50,000 foot view” for Deep Learning and a sense of where all the pieces are going to fit.

So instead of jumping right in to the nearly 13 hours of Coursera class video I found A Tutorial on Deep Learning Hinton presented in 2009. The tutorial runs 3 hours and felt deeper than what I needed at that point. It may have been my unfamiliarity with Hinton’s
presentation style or maybe he was trying to compress the entirety of his experience to fit the venue. I did not finish watching this one.

Along the way I found Hugo Larochelle’s Neural Networks class videos. This is not a MOOC, just the videos from the course he teaches at Université de Sherbrooke. I like the presentation style and find the material accessible. He covers many neural network topics well beyond the basics. Still this is over 16 hours of lectures! I am impatient to get further in to deep learning concepts so I’ve been cherry picking these. I’ve watched most of the first two units to “calibrate” with what I learned from Ng’s class and have also gone through most of the autoencoder material.

Next I came upon a LeCun and Bengio lecture I could really get my mind around. It’s the one hour 2009 ICML workshop Tutorial: Deep Learning Architectures. Digging further I found LeCun’s 2013 ICML Deep Learning Tutorial very useful. This one is 1½ hours long. Now after watching these two I’m starting to see the shape of the field, that 50,000 foot view. This is what I need for my learning style, everyone’s style is different.

Will Stanton’s presentation at the May 2014 Data Science & Business Analytics meetup confirmed many of my perceptions, corrected some others, and exposed new dimensions to investigate.

Immediately after that I found a talk by Jeremy Howard at the Data Science Melbourne meetup on May 12th, 1¾ hours. (Jeremy was enrolled as a student in the initial offering of Bill Howe’s Coursera Introduction to Data Science. He contributed to many interesting forum threads there. He is currently involved in a startup creating deep learning tools. His last position was as president and chief scientist at Kaggle.) This talk is excellent. Jeremy covers the emergence of deep learning over other techniques, social and economic implications, and a bit of implementation detail. This is a must-watch, the talk that I would have liked to start with, but it did not yet exist.

Now at this point I am finally ready to absorb some Hinton presentations. His Google Tech Talks Recent Developments in Deep Learning (2010) and The Next Generation of Neural Networks (2007) were very good even if I did watch them out of chronological order. Each talk is one hour.

What next?

  • Working through the neural network course videos from both Hinton and Larochelle.
  • Trying out open source deep learning software packages.
  • Continuing to search/watch lectures. They are not all on youtube.
  • Google+ communities Machine Learning and Deep Learning are both very active.
  • Reddit has a very active and interesting subreddit /r/machinelearning. Yann LeCun did an AMA there last month. (There is also a /r/deeplearning subreddit but nothing is happening there.)
  • Check out LeCun’s Spring 2014 NYU Deep Learning course videos and notes.

GPU Shopping

GPUSo you want to run some deep learning code and heard you need a GPU? This seems correct. For deep learning problems of sufficient size to be interesting the alternative is to either (A) use a supercomputer or (B) have your body cryogenically preserved and arrange to be reanimated when your program completes. The second option makes iterative development cumbersome.

You could also rent time on a GPU equipped cloud instance. But I believe this only makes sense with both already-debugged code and the appropriate economics. A recent deep learning success story reported that two GPUs were used over a week long training period. At the current AWS rate this would cost about $220 for that single run.

Which GPU Vendor?

Nvidia. That’s it. Nvidia has been very active in promoting their CUDA programming model. There are online courses available for CUDA and a variety of development tools. Local development will be the way to go for creating new code. The OpenCL programming model is an alternative for GPU use but it does not have the fine grained device control necessary to take maximum advantage of the hardware. Current versions of Theano, Torch, and other machine learning libraries work with CUDA, not OpenCL.

I have not been motivated to understand why, but it seems that bitcoin miners prefer the AMD/ATI GPU family over Nvidia for their code. But that’s bitmining not deep learning.

Which Card?

How much do you want to spend? At the lower end $100 will buy a GTX 750. This card has 512 compute cores but is only available with 1GB of memory. This seems like it will run out of space on machine learning tasks very quickly. In the $130 – $150 range the 640 core, 2GB, GTX 750 Ti looks like a much better choice. Both of these cards are based on Nvidia’s latest “Maxwell” processors and have much lower power requirements than earlier cards.

A step up to the $380 – $400 range will buy a 1536 core, 4GB, GTX 770. This is the same level of GPU muscle available through AWS cloud instances. At the top of the range you can be really crazy and get the Titan-Z with 5760 cores and 12GB of memory for $3000.

If the purchase can wait until late 2014 the Maxwell architecture is supposed to be moving in to the mid and high end cards. So cooler, quieter, and less need for a high wattage power supply.

Maybe an older card to save a few bucks? Maybe, but the “compute capability” evolves as the generations of GPUs move forward. Check the charts on the CUDA Wikipedia page to see what compute capability goes with a specific GPU. I recently saw a low end GT 640 for a good price. That seemed like a nice starter for some at-home CUDA coding. However further investigation revealed the card I was considering had slower GDDR3 memory and a lower compute capability compared to other GT 640s with GDDR5. Not such a good deal after all.

Where Will You Plug It In?

Now the decision gets complicated. First understand that moving data between the GPU and the computer’s main memory is going to be done a LOT. This and coordination overhead will take up most of the time spent on the “GPU computing” portion of processing. Even though there may be 1500+ computing cores available on a GPU they cannot all be kept busy all the time. You will actually be doing quite well if the GPU-aware parts of your problem run 50x to 100x faster than a CPU-only version of the code. It will help if GPU/CPU data transfers occur less frequently by having enough GPU memory to keep data on-board avoiding repeated back and forth shuffling. It is also important the transfers that do occur are as fast as possible.

Most of these cards are designed to go in a PCIe 3.0 motherboard slot. Will it work in PCIe 2.0 slots? It should, but data transfer rates are 2x faster for the 3.0 slots. The article Theoretical vs. Actual Bandwidth: PCI Express and Thunderbolt gets a lot deeper in to 2.0 vs 3.0 difference and “lanes” (the Thunderbolt stuff does not apply here). But their performance summary conclusions are for game image rendering and not machine learning computation.

Oh, and where did you plan to plug in your monitor? I’m not sure what fraction of a GPU would be consumed by the OS user interface, but it can’t be good for running machine learning tasks at full speed. So what about that UI then? Two choices, both require a relatively modern motherboard. If your CPU can handle graphics directly (recent Intel and the AMD APU series), problem solved. Or if you have another dedicated GPU for graphics you will need an additional PCIe slot for that. Just watch out for the “lane restrictions” with multiple PCIe devices. For instance some mobos will use x16 lane connections to a single GPU but plug in a second GPU and they each get x8 lanes. One possibility (I have not actually tested) would have the shiny new GPU for machine learning tasks in a PCIe 3.0 slot and a less capable GPU dedicated to the user interface in a PCIe 2.0 slot.

Need to play Call of Duty, Watch Dogs, or GTA using your hottest GPU? Better get used to reconfiguring displays and swapping cables.