So you want to run some deep learning code and heard you need a GPU? This seems correct. For deep learning problems of sufficient size to be interesting the alternative is to either (A) use a supercomputer or (B) have your body cryogenically preserved and arrange to be reanimated when your program completes. The second option makes iterative development cumbersome.
You could also rent time on a GPU equipped cloud instance. But I believe this only makes sense with both already-debugged code and the appropriate economics. A recent deep learning success story reported that two GPUs were used over a week long training period. At the current AWS rate this would cost about $220 for that single run.
Which GPU Vendor?
Nvidia. That’s it. Nvidia has been very active in promoting their CUDA programming model. There are online courses available for CUDA and a variety of development tools. Local development will be the way to go for creating new code. The OpenCL programming model is an alternative for GPU use but it does not have the fine grained device control necessary to take maximum advantage of the hardware. Current versions of Theano, Torch, and other machine learning libraries work with CUDA, not OpenCL.
I have not been motivated to understand why, but it seems that bitcoin miners prefer the AMD/ATI GPU family over Nvidia for their code. But that’s bitmining not deep learning.
Which Card?
How much do you want to spend? At the lower end $100 will buy a GTX 750. This card has 512 compute cores but is only available with 1GB of memory. This seems like it will run out of space on machine learning tasks very quickly. In the $130 – $150 range the 640 core, 2GB, GTX 750 Ti looks like a much better choice. Both of these cards are based on Nvidia’s latest “Maxwell” processors and have much lower power requirements than earlier cards.
A step up to the $380 – $400 range will buy a 1536 core, 4GB, GTX 770. This is the same level of GPU muscle available through AWS cloud instances. At the top of the range you can be really crazy and get the Titan-Z with 5760 cores and 12GB of memory for $3000.
If the purchase can wait until late 2014 the Maxwell architecture is supposed to be moving in to the mid and high end cards. So cooler, quieter, and less need for a high wattage power supply.
Maybe an older card to save a few bucks? Maybe, but the “compute capability” evolves as the generations of GPUs move forward. Check the charts on the CUDA Wikipedia page to see what compute capability goes with a specific GPU. I recently saw a low end GT 640 for a good price. That seemed like a nice starter for some at-home CUDA coding. However further investigation revealed the card I was considering had slower GDDR3 memory and a lower compute capability compared to other GT 640s with GDDR5. Not such a good deal after all.
Where Will You Plug It In?
Now the decision gets complicated. First understand that moving data between the GPU and the computer’s main memory is going to be done a LOT. This and coordination overhead will take up most of the time spent on the “GPU computing” portion of processing. Even though there may be 1500+ computing cores available on a GPU they cannot all be kept busy all the time. You will actually be doing quite well if the GPU-aware parts of your problem run 50x to 100x faster than a CPU-only version of the code. It will help if GPU/CPU data transfers occur less frequently by having enough GPU memory to keep data on-board avoiding repeated back and forth shuffling. It is also important the transfers that do occur are as fast as possible.
Most of these cards are designed to go in a PCIe 3.0 motherboard slot. Will it work in PCIe 2.0 slots? It should, but data transfer rates are 2x faster for the 3.0 slots. The article Theoretical vs. Actual Bandwidth: PCI Express and Thunderbolt gets a lot deeper in to 2.0 vs 3.0 difference and “lanes” (the Thunderbolt stuff does not apply here). But their performance summary conclusions are for game image rendering and not machine learning computation.
Oh, and where did you plan to plug in your monitor? I’m not sure what fraction of a GPU would be consumed by the OS user interface, but it can’t be good for running machine learning tasks at full speed. So what about that UI then? Two choices, both require a relatively modern motherboard. If your CPU can handle graphics directly (recent Intel and the AMD APU series), problem solved. Or if you have another dedicated GPU for graphics you will need an additional PCIe slot for that. Just watch out for the “lane restrictions” with multiple PCIe devices. For instance some mobos will use x16 lane connections to a single GPU but plug in a second GPU and they each get x8 lanes. One possibility (I have not actually tested) would have the shiny new GPU for machine learning tasks in a PCIe 3.0 slot and a less capable GPU dedicated to the user interface in a PCIe 2.0 slot.
Need to play Call of Duty, Watch Dogs, or GTA using your hottest GPU? Better get used to reconfiguring displays and swapping cables.