@activation vs @outputactivation in fluid.mlpregressor~

So pretty much the entire time I’ve used fluid.mlpregressor~ I’ve always tested/varied the @activation based on what I’m sending in (typically tanh for -1,1 and relu for 0, 1), but I’ve always left @outputactivation at the default linear output (honestly didn’t even know there was a second activation output until a buddy mentioned it the other day.

The reference isn’t especially useful in clarifying things as this is the text for @activation:

An integer indicating which activation function each neuron in the hidden layer(s) will use.

And the @outputactivation is:

An integer indicating which activation function each neuron in the output layer will use. Options are the same as activation.

So is this a matter of just selecting something that will mesh well with whatever kind of inverse transform (or whatever) is on the output of the regressor? (e.g. using tanh on the output if you used standardize/robustscale or relu if you used normalize) or is there something more at play here?

I guess with the actual @activation this is tied into the gradient descent and will have an impact on how the network builds, functions, and has its loss computed, but the @outputactivation is just a single layer. Is it also included in the loss computation and/or gradient descent etc…?

In short:
How should one use @outputactivation vs @activation?

1 Like

I may be out of my breadth here, but my guess is that the output layer must be part of the computation of the network’s error against the target data. I understand it as being a fully connected layer of neurons, not simply a transfer function in series with the output of the network.

1 Like

That makes sense, just not sure what the benefit/difference is in changing one activation type and not the other. So I feel there’s either a use case I’m not understanding, or a computational impact of doing so.

By and large you just don’t need to be interested in the output activation unless you’re getting arcane. It’s pretty common practice to use a linear layer for regression and a sigmoidal layer for classifiers (the latter because its easier to squish each neuron into a binary-ish output).

Interesting. As in, independently of whatever activation is being used for the main @activation? (or is this general best-practice for both activation types?).

Yup, independently.

1 Like