On feature scaling/data sparseness (in a specific spectral context)

[Yes, the help files are behind the curve a bit. Bear with us, etc.]
Presumably something going into more detail than the rambling video I did in the summer? With the obvious disclaimer that @groma is the geezer who really knows this stuff, some more in depth pointers:

This paper by Yoshua Bengio, Practical recommendations for gradient-based training of deep architectures, is a chapter from the eye-wateringly dear book Neural Networks: Tricks of the Trade. Besides other things, it describes most (all?) of the adjustable knobs you’ll find on our mlp objects, and some indication of how to approach them.

This paper by Leslie Smith, Cyclical Learning Rates for Training Neural Networks, whilst actually about a scheme for programatically optimising learning rates during training, generalises to some pragmatic advice, I think (which boils down to it being essential to establish a workable range, whether or not you’re using an automated schedule or not).

The momentum parameter is also quite important in squeezing training performance out of a network. This article, Why Momentum Really Works by Gabriel Goh, dives into that, and because it’s on distil.pub, there are nice widgets to play with.