Too cool, must learn more about this: Daniel Nouri’s Using deep learning to listen for whales uses multiple convolution layers to process inputs to a neural net. Details? Dunno much. In the whale case audio is treated as a greyscale image. Are the convolution layers treated as an independent feature generation stage in front of the neural net? Or are all weights learned together? Since convolution layer weights are shared among multiple neurons it seems a deep “normal” net could achieve similar results but may result in training times measured in archeological time units.
Good stuff in the references list at end of the post. Especially see the Nature article in  Computer science: The learning machines for an overview. Related: How Google Cracked House Number Identification in Street View and of course Kaggle’s The Marinexplore and Cornell University Whale Detection Challenge.