Regression: input/output lists (any range)

fra.dimaggio · October 28, 2022, 8:10pm

Hi,

I am experimenting with fluid.mlpregressor~. My wish is to feed the neural net with input/output lists of a higher “range” than 0-1 (i.e. -2000-2000). Looking at the documentation here Learn FluCoMa it seems like setting the “activation functions” to 0=“identify” should do the trick, but haven’t been successful. Perhaps there is something I am overlooking? Thank you very much in advance. Francesco

weefuzzy · October 28, 2022, 10:03pm

Hi and welcome @fra.dimaggio

There’s nothing to stop you using ranges other than 0-1 with the regressor, nor should it matter (hugely) what activation function you use on the hidden layers: remember the inputs to the activations are weighted using weights learned during fitting, which can adjust to rescale. It possibly makes more of a difference what you use as an output activation function though (but, IIRC, linear is the default for this).

Can you say more about the particular problems you’re having? Is the network failing to fit?

fra.dimaggio · October 29, 2022, 9:51am

HI @weefuzzy, and thank you for your quick reply.

Let me post here the patch, hopefully you can see where it gets wrong.
I took the simple regression example, and changed the input/output list range:

input is 0-127
output is -2000-2000

However, when trained, either I get the error message “fluid.mlpregressor~: No data fitted”, or prediction output ranging 0-1.

Is there any setting I need to adjust in order to get this behaviour right?

Thanks,
Francesco

test-mlpregressor~.maxpat (14.9 KB)

fra.dimaggio · October 29, 2022, 10:59am

Hi @weefuzzy,

From the Music Hackspace Workshop “Flucoma” (in the examples folder) I found this, which seems to have helped me better “understand” and manage the range issue: i.e. using fluid.normalize~!

I have quickly applied it to the regression example, and it seems to working just fine.

What do you think? Is this a good way to handle the ranges “issue”, or perhaps you where proposing a different approach? Thanks.

Francesco

test-mlpregressor~3.maxpat (26.1 KB)

weefuzzy · October 29, 2022, 1:14pm

If normalizing or standardizing are practical / feasible for your purposes then, yes, I’d absolutely endorse doing this because it will (generally) help the fitting converge faster and (generally) require less tuning of parameters.

FWIW, I was able to get things to converge without normalizing, and it could be useful to explain my approach. tl;dr I turned the learning rate down to 0.0001, used a single hidden layer with 10 instead of 3 neurons, and left the output activation as linear.

In more detail,

getting a neural network to fit always requires a certain amount of trial and error
generally, the most important parameter to adjust in the first instance is the learning rate. The network shape matters less than one might think (though this is less true in cases like this with very small datasets).
when you call fit it returns with a number – the loss – that gives an indication of how wrong the network’s predictions are with respect to its output dataset
the worst case for this number depends on the squared ranges of the output dimensions, so will be much bigger when the output ranges are outside 0-1. In this case, the output is 10 dimensions each with a range of 4000, 10 * 4000² = 1.6x10⁷.
the goal with training is that the loss gets smaller, tending towards (but never reaching) 0. It’s not worth fixating on absolute values for it beyond this because how good a fit you can get depends on the input and output data, their relationhsip, the network shape etc etc. However, having an idea of its potential range is useful
for larger output ranges, you will (generally) want a smaller learning rate, because the learning rate scales the loss when adjusting the network weights. If it’s too big, and you have large output ranges (and large potential loss values), the weights will swing all over the place and the network won’t converge.
however, smaller learning rates also imply slower convergence (i.e. more iterations to get somewhere useful). So, often the price for not normalizing / standardizing will be that it takes longer to fit the network

What I tend to do when adjusting the network is turn the @maxiter down to something quite small (like 10) and then hit calls to fit with a qmetro (say every 100ms), and then look at what the loss is doing for different settings. Either just with a message box, or with a multislider in history mode. The basic routine is to try and zero-in on the value where you get reliable convergence, i.e. the loss is decreasing with each call to fit. If the learning rate is too high then the network might just blow up (fit will return -1) or the loss will start going up. If it’s a little too high, then the loss might go down sometimes and up some other times. If it’s too low then it might go down, but very slowly, or just stay still.

Once you’ve found a ballpark, it’s worth adjusting with some trial values and then completely clearing the network and starting again a few times with that learning rate and seeing how reliable / consistent it is. This is how I ended up with 0.0001. With higher values, I was finding that convergence wasn’t reliable. Sometimes it would, sometimes it wouldn’t. Having settled on a useful-seeming value, I then left it running for a while to see how ‘good’ it could get. As it happened, with the example mappings I made, the loss went down from an initial 2x10⁶ or so to around 0.000001 after around 58000 iterations. So it got there, but it took a while.

By contrast: if I normalized the input and output data, I was able to use an learning rate of 0.5 and converged within 190 iterations. So much, much quicker!

fra.dimaggio · October 29, 2022, 2:26pm

Excellent - Thanks a lot for this explanation!

Coming from a wekinator/rapidmax background, I wasn’t aware of the internal architecture of the neural network. Being used to feed it with any range of input / output, I’d have expected prediction output to maintain the same nature of the ones used during training. But I seem to get it now.

Food for thoughts.

I am happy with the normalisation now, but I will try to replicate your approach to better understand the logic behind it - btw, in which circumstances would you choose one over the other?