On feature scaling/data sparseness (in a specific spectral context)

a.harker · January 24, 2021, 1:24pm

I am doing some work with the MLP classifier and my input is a normalised spectral amplitude frame. This has potentially been derived with various sanitisation procedures/condensing etc. and so although it represents a spectral amplitude it may be derived from the original magnitude data through a number of stages. However, the important aspect of the data is the detail of the resultant spectrum.

Two questions:

1 - if I clean the data by zeroing some bins (which I consider not of use) then my fitting diverges massively - does anyone have a possible explanation?

2 - I am currently doing no feature scaling on the data - intuitively I consider the relative scale of the features (the bin or frequency magnitudes) to be important, and that this shouldn’t be altered. It so happens that my range will be 0-1. [edit to add an actual question - does this seem sensible?]

[my activation function is currently a sigmoid, should it be relevant]

Any insights gratefully received - I’m currently working on refining the sensitivity of my matching, as I have OK results, but I’m not quite there yet - I’m at a stage where many things could change (even the use of the MLP) but I want to avoid obvious areas in which I might be missing something.

weefuzzy · January 24, 2021, 1:49pm

Just a quick passing thought: does the training behaviour change at all if you use tanh activations instead? With sigmoid activations setting stuff to 0 is saturating that input, so rather than a don’t-care it would seem to indicate a we-care-a-lot.

a.harker · January 24, 2021, 1:59pm

If I use tanh activations I get the same result. I have an alternative method where I set them to the threshold, rather than zero which seems to improve (rather than wreck) things, but I’m still trying to understand a bit better what is going on.

weefuzzy · January 24, 2021, 2:17pm

Could be that the learning rate is now too high? [hand-waving] Maybe zeroing inputs makes for a lumpier error surface, so it’s more prone in training to jump between lots of different local minima…

a.harker · January 24, 2021, 2:18pm

Actually, I’ve just realised I have two sets of zero-ing happening here, and it’s only one of them (probably affecting more bins) that causes the issue.

weefuzzy · January 24, 2021, 2:46pm

I’d still suggest experimenting with the learning rate as a first recourse if it’s diverging during training.

a.harker · January 24, 2021, 5:26pm

OK -so I want to set lower - is that right?

jamesbradbury · January 24, 2021, 5:54pm

If you get wild results then lowering it should help to smooth out the change between epochs and to make sure you don’t miss local minima. I find that when the learning rate is too high (for processes that have a learning rate) it is akin to loading a spring with too much force (it may fly off in any direction).

tremblap · January 24, 2021, 5:54pm

Now I’m sure @weefuzzy will have wiser approaches… My trick, which is dirty, is to set it at 0.1 and see after only 1000 iterations. if it bounces aka if the returned error is going up and down, I divide by 10 and continue until in bounces again, and divide by 10 again. that way I progressively reach some point where it stops.

@groma told me that was crazy and restless and he uses smaller values and more iterations

weefuzzy · January 24, 2021, 7:56pm

Finding a decent learning rate can be frustrating. Values as small as 1e-5 or 1e-6 aren’t uncommon; perhaps even smaller with more complex tasks. I might typically start by finding the order of magnitude that seems to converge, and then tweak a bit from there.

a.harker · January 24, 2021, 8:31pm

Thanks all - after no success with learning rates, I had some more investigation, I think what was happening here was probably that some of my data was getting totally zeroed - but still passing the energy test that happens prior to the more detailed spectral analysis. Thus I likely had cases linked to all labels in which the data was all zeros (alongside actual data).

It turns out that this is not helpful.

The good news is that if I do my energy test after the thinning of the data I get better accuracy (within my moderately shoddily defined testing procedures) - so it seems like I might be on a reasonable track towards a good level of accuracy. I will continue to sanitise and at some point I’ll need to figure out how it will fare with much bigger training sets (right now I have 6 classes and about 3000 training examples).

The learning rate ideas may also be a route to getting better results. At the moment most of the parameters to the MLP feel pretty mysterious so I will make it a point to investigate these more seriously in order to further massage the numbers in the direction of what is in my case “good”

a.harker · January 24, 2021, 8:57pm

So - in relation to learning the parameters of this object - where should I look? The help file and reference in max don’t explain the terms - just name them - obviously I can google, but is there an advised source of info on this? The forum doesn’t appear to have a comprehensive set of answers (if I search momentum I just get results from code, for instance).

weefuzzy · January 25, 2021, 12:24am

[Yes, the help files are behind the curve a bit. Bear with us, etc.]
Presumably something going into more detail than the rambling video I did in the summer? With the obvious disclaimer that @groma is the geezer who really knows this stuff, some more in depth pointers:

This paper by Yoshua Bengio, Practical recommendations for gradient-based training of deep architectures, is a chapter from the eye-wateringly dear book Neural Networks: Tricks of the Trade. Besides other things, it describes most (all?) of the adjustable knobs you’ll find on our mlp objects, and some indication of how to approach them.

This paper by Leslie Smith, Cyclical Learning Rates for Training Neural Networks, whilst actually about a scheme for programatically optimising learning rates during training, generalises to some pragmatic advice, I think (which boils down to it being essential to establish a workable range, whether or not you’re using an automated schedule or not).

The momentum parameter is also quite important in squeezing training performance out of a network. This article, Why Momentum Really Works by Gabriel Goh, dives into that, and because it’s on distil.pub, there are nice widgets to play with.

weefuzzy · January 25, 2021, 12:36am

This is another chapter from the Tricks of the Trade book, Stochastic Gradient Descent Tricks by Léon Bottou, which intersperses some technical description with pretty clear advice in bold print in boxes (so, good for the skim reader in your life…)

tremblap · January 25, 2021, 8:41am

There are also examples in the example folder which should help intuit some of them… and there are also @weefuzzy and @groma curated website in the other threads on MLP, from towardsdatascience. Their explanation is quite clear enough to get one going…

tremblap · January 25, 2021, 9:40am

This is really fun to play with, and so clear. I presume @groma was right once again and I’ll have to be more patient, lower my learning rate and raise my momentum

a.harker · January 29, 2021, 6:22pm

OK - the main improvements I seem able to make are in terms of the input data - I am now up near 90% accuracy even for stuff that is probably not perfectly represented in my training sets, so this is all quite promising.

In fact my learning rate went higher to do this - I can’t seem to get much effect out of momentum in my particular case.

I still don’t understand what changes when I set validation to something other than zero, however… I get the concept fairly generally, but I don’t understand what will change in the output of this object, or how to make use of this.

tremblap · January 29, 2021, 9:05pm

@groma will know for sure, but validation would be the ratio of the dataset you put in for training that is kept to check your training (instead of checking on the training data). The google playground (https://playground.tensorflow.org/) has an equivalent IIUC, which is the ratio (first argument)

weefuzzy · January 29, 2021, 9:06pm

The idea is that validation will enable early stopping if the performance of the network against some reserved test data doesn’t improve over a number of iterations. So it can help stop overfitting.

a.harker · January 29, 2021, 11:18pm

Thanks @weefuzzy - as a concept that’s great, but I guess what I’m not understanding is how that works in practice. I’ve seen some patches that have a feedback mechanism for epochs on the forum (is that what I’m supposed to do here?), but from your answer I don’t know:

if the early stopping is my responsibility or that of the object
when the validation data is used and what for
how I access any values related to validation and use them

these are less conceptual questions than object interface questions and I don’t know how I can learn the answers (they aren’t in the help or reference, and I don’t see an obvious example in the examples, but I might be skimming them too fast). Likewise the object has three outlets and I don’t know what they do or how to find out… If I’ve missed something apologies - happy to be pointed to whatever is helpful.