FR: adding a 'fit stop' message to regressors

I know this is a bit late in the game for an FR, but this wouldn’t really break things and would make it easier to run big trainings on regressors without needing to pinwheel.

So the main issue is that I, like I imagine many people, do my regressor training in a defer’d loop like this:


----------begin_max5_patcher----------
1949.3oc4ZssjiZCD8YOeEp3YutzcgxS6+QRpov1Z7RBFbA3Y2IoR91SKI.i
Gi4hMimjJT6XVDRp0o6S2RsD+4SKBVm8CSQ.5mP+LZwh+7oEKbEYKXQ0yKB1
G8iMIQEtpErIa+dSZYvR+6JM+nzUdbI56YGS1hVaP6xMQknxLT9wTTQ1dS42
hS2gRh+cCB9uEnu+MStAAMYcTbRAzjWxfmyd0j+RbYIT2kn0GKQwonHzuDjj
kc3WBpkXRbpYS1wTmXYUEdH2T.ipnx3rzmOqF75ZDUtwNJdN2rozCXoFuBuD
wnT6MJ1cSJWgQ+ZUiROtONMwT5fNopv3sN.ms929BEWOpVeb85DSQ7Vi8kzS
sO6XYcGfOqtttzVve8zS1eVNRCvdSQQzNyEFfMIln7fqi1PkvgVszcy8fE4c
CV5kfknC5DV05EeQkucv3EXP.5WuA7cMBlm3Dmjf1kgxRQVJCPXVhfeSQwu.
LKCBTAIugJJyNbvx31jGWZxiiPPSAVX+TnKUYrPGAQPY1aDg+IxDHHDUvGHU
3ZpJ5JD3GgxMfiXQQVerBBV4I+RODcHkwZAwWxRKKh+COu1pAFIxEeFHmrBk
jEsEsMpLpvJwqibNygbt21JUyFv4AeBd+V6MLFwBBi1CnUXGb4ggNTSISOJ.
8gDEnOblDsq34zr78nCYEw1P99GGF2UA+HR5jwM6g.6Ty2AgcIpSNFuc09jC
Mtz+M5qeKd6VSZRzal7BXRL6+9Zzlx3WcyBhXnuBcsM9G3j6tFAuf43Ebrte
8SGzdY25G5U0OKmapwg73SQD5Z5dQkwuZV+ISAB+Lo.fo+fIcqK7VOfTHIs.
oZ51Q8mp68P1Pt9Nsgp+CXC436zFJdHfDVgdYV5v3f6WgAu929gwgn7HXoZl
7mMoQUSThuDh7IBw0Qo6lUa413MkMy+f9pY+ZyVDYDdlUqzhwmL28Jy7JtFl
sCQXvEk+lKXa6+ZjocgRUfuB8KfEWkzdHbdkpq0h.rWLNRJ9DLf2PpeCy9FE
6rqV0iVWu2UEaqXXQqKYqVwN0pvVW5knufWQCOqrVMi2LnHxVWJWyTTUqqvV
MS3alqqe2f7KWLDZ0NYS6dWe6Z20Qm5T6tTobczE1zr200NK.8rKuaPkifyU
vyEJhd0r8YfoAb1miJKyiAWb+tCzld3n5sb1f7oL6h17VKO0dcia+x8QGNz7
5mNMnlQeTX0hiw8rNnak64Pq.pqUF+fcO4ix8DXDXBQSzTViehMQAUHUq0b+
kuHA7nlz3V3pEjhf5Du48d3N5FWR3j553cBwXMQPwsnfXMkGp4s6cA3Lvohv
dhKvHJIWQ3MCyuXaFWpArz1MDSvDMixaAGpLTowTR0UWwODRLIThEdORFnlH
7y7RAAqvveMXCJhpwZciNoq3KTLjCFsUO4B4fELgWRRFfIFi1pWwgDshzkes
nIHKlpfQSaEnjyrVryzDLBKDVp+Iuem4fRDRQGphSAovfcjxvgs3HvXSg0r1
zFH1hVGROKhKmAihvF10kgxfNmKvLp3LcKVwCEgp1lLAHKIQdNdTvnOT0wX+
TDOghQ3LgrMykH.dI+L3Pfr+NmeaCUpk3SyL8+j3h9zIq1ij+d7YQWsDFvox
u9TwjWYJ3k8vxSbbfuYyDFdtgZbywSG2T8CC2kY61kX5KYeVKzLGqKu30cNI
GzwZVISMYDaBfyo8tvjfv8N0ek9P3xSgPT8acYccL.WY2gX8lNxxS2+WwljQ
7GDBSbyaRFLk+mY90kfgd8fFZFdTadPmG2yT2kK2tY7Aj8YIZcuT5ZSoGp2P
fZJVMQnVSkma+WKTeYXnx0d2WEYkXxXM7Fw5KIYQyKZ2ZRhdCIwCxhEgtv2h
aXaDv5O8MOw7hIOI66ChRo+n2FBkcYTI3OyPQ6hJGdJXge61kgS2FRHelnKG
5di8LFGgeo21w02.Fo+6XshS7bWFgNAy8QmuEZM6goTfnafL5ANUSuTQiUpw
glWfUgD45A4z1yWJ9VRd31Cds+XRYbQR7VSemfNjVtSIncwiY7pCRWNSazME
23oaOP6x270Tzqh.pIHQ.LmsEsUGCHlv0XBoYSGmSJiM63iw8nszTU6Chcrb
lQprFJluc34Ms9SFMXbQKciM2WMx69Tob.zV94pjhri4ap6p5OcFzIHt0TTF
m5BWzpRRec5TqOZAgePBh8fjibDxwNOAhbmBxk5xfhxE98tAUSuLfrnyfrFC
yysqL2sjBeXRRMFIQ0yfjDOLIwGCi3r5jkamXpdCB+PELoaAStSufQ4wgmGO
N5njU3rHKxnjkdVjEdTx5bM893sGxf7zqlMKjZSY194wnCkZb04bRkD2G6X+
uhQCWIwPAZt6UBl8rsuxqt6Pz5QE1bNBwzPEFPVjYYpmPeuLfQ7iwKbjBG+Q
D6wsKSCK724P8N5qcQjV5FmsJTqXgX+IAwUKqX1W9l61folhW2cJK4nIh2ur
XihKnlMYMrNbNVD4Xh8OGQiGSb+K4D9rKhNb3USdQUkch.xR42xbdY5ktGiS
8O5x7IH27Zbc8ceK.AQ4P5VkPtVGycCqfeH8emTA6y.+0THyLe7B2QaF3RaK
Exqp3PjGGtr6d5ud5eHRJQAK
-----------end_max5_patcher-----------

This works well in that you can monitor the progress with smaller @maxiter sizes, then manually stop the training when you think you’ve gone far enough.

The issue here is that it’s quite easy to overfit the data as there the “early stopping” criteria is ignored by this approach since even if the amount of epochs don’t improve after a bit and the actual fit-ing bails, the new fit message keeps the process going either way.

The version of the patch on the right would work better in that you just run a single huge @maxiter and call it a day. Or manually run it a few times if need be, but the problem with this is that it’s hard to tell if you’ve actually fit things well, especially since the loss range is arbitrary(ish), so you get a single number back, with no context as to whether that’s good or not.

So what I’m proposing here is to add a message output to the regressors where they output a stop message, or maybe fit stop (the former would likely be better as it wouldn’t break any existing patches) when the “early stopping” criteria is met. That way in a loop like above, I can use that to stop the overall training and avoid overfitting.

It would also mean that I could avoid needing a toggle for the loop at all, where it can just train until it’s happy, then bail and let you know how it went.

////////////////////////////////////////////////////

A more fancy version of the FR would be to have any fit message report the loss as it’s being computed, rather than a single time at the end. Or perhaps it can always spit out x amount of loss values per @maxiter such that if you set @maxiter 100000, it will spit out 10 loss values as it is being computed, so you can see if things are improving over that time.

////////////////////////////////////////////////////

Lastly, if there’s a way to do this currently that I’m missing, do let me know.

So: the early stopping criterion is against the error on held out validation data rather than the training data (if @validation > 0), so there isn’t actually a simple way to implement early stopping outside the object: one would have to manually make and test against a validation set.

That’s not ideal, we should consider adding an option to report the validation loss as well, which is lowish hanging fruit. More complicated would be to have something fit-like that actually persists the whole relevant state between calls – tracking the validation loss, but also using the same partitions of the data.

The fancy version would be very hard to do as it stands: the fitting loop would need to be fired off into a different thread in order to report the loss(es) asynchronously.

In this case I’m using the default @validation 0.2, so there’s some in place, it’s just being banged over and over externally obviously.

I guess a validation output message would solve the use case here (I think?), where I can just check for that and call it a day once that stops improving (or whatever).

I guess in general the UX is tricky to navigate (as a non-data person) since ideally this would either be a single message thing “make it good”, with the stuff like @maxiter being more like the @zlmaxsize or @maxfftsize as an upward limit of what you may ever expect to do, but not what you intend to do or having a clearer iterative process where you can monitor the progress and decide when it’s good enough (if you know what you’re looking for).

Don’t know if there’s any overlap with the partition stuff you were talking about in this thread, but I’m a big fan of overhead being added in exchange for ease of use/better results.

‘partition’ here is how the data gets split into training and validation sets (or, more generally, into training and test sets). As it stands, every time fit gets called a new partition is made, selecting a random portion for the training data and validation set. So, in that sense calling fit iteratively isn’t equivalent to calling it once with a big maxIter. How much that matters, I’m not sure tbh – clearly it works well enough for getting the training to converge, but I wonder if it would make comparing validation losses more noisy.

(The answer is suck it and see, ofc. I’ll put an issue up to prod at this when I can)

1 Like

I like this.

//=============================

I know I keep doing this thing where I read a conversation between you two and just drop a patch here and peace out… maybe it is useful, maybe it is annoying?

Here’s a patch doing classification that creates a training/testing split (randomly about 80/20 respectively) then then validates on the testing data. This way we get the training loss, the validation loss, and the testing set persists across fittings. However scrambling the validation set (as happens when using validation > 0 and repeatedly calling fit) is in itself a kind of cross validation that is helpful in preventing overfitting).

This patch could certainly be cleaned up maybe formed into a nice abstraction?

2 Likes

Most emphatically useful :heart:

Love it :pray: I know I’ve said it before, but a compiled partitioning object feels like an increasingly serious lack. Doing the partition at the point of dataset creation as you do here makes it less faffy than trying to partition extant data / label sets in extant code (which is variably dreadful in different environments: not even sure there would be a way in PD at the moment), but seems like it should be a one-click job in a better world.

3 Likes

Riiiight. Ok, that wasn’t clear to me as I just assumed that some of that would persist over.

So the intended UX here is to just use a big chonky maxiter once and call it a day? (unless one is purposefully trying to cross-validate or do something special).

Yes please!

For stuff in those other threads about computing class means, or removing outliers, etc… it’s an absolute nightmare to code the manually separating/dumping/organizing of the sub-datasets, much more so than the actual computation being done (means/outliers).

Interesting!

I’ve done this kind of stuff manually, which is a big pain in the ass. I do wonder how generalizable this would be using an arbitrary amount of classes etc… but being able to get a % correct out of the validation data would be super handy.

(I do wish the reported error was easier to tie to some kind of normalized range so it intrinsically communicated more)

So as I mentioned in this other thread, been working on a project with Jordie Shier where we’ve been training up a load of NNs to predict longer time series from shorter ones. Basically a much more fleshed out version of some of the stuff I’ve been doing in SP-Tools where you can take a 256 sample window and predict the descriptors for a 4410 sample window.

Long story short, in order for this to work really well, Jordie ended up computing these in Python rather than in Max, and when plotting the loss how well different approach did, it was clear that overfitting was a big issue with the Max approach.

Here’s a spreadsheet showing what’s up:

So the rows are different audio files and example subsets, and the columns are different engines training up NNs. B and C are fluid.mlpregressor~ and D is Jordie’s Python-based implementation of the same code (matches, but not 100% numerically). The biggest difference is in the column G where it’s night-and-day difference.

So all of that is to say that being able to parse the validation loss while iterating through a training set would be game changing on the viability of training up a chonky NN in Max (without making a single large run and pinwheeling Max).

I guess having an option to keep the partitioning across iter calls would be great too, but that just ends up doing some cross-validation which isn’t ideal, but also not terrible. Particular as compared to the danger of overfitting.

I don’t know enough about how the guts of the algorithm are running, but presumably it can just dump a message prepended with validationloss or maybe just loss, in addition to the fit message it sends out.

edit: Also re-read this useful thread where some of this is unpacked, but in the context of SC where I guess getting the validation loss in an ongoing manner is possible.

Are those numbers averaged mean square error against a test set or something?

In any case, yes, this would neatly show why early stopping is a Good Thing.

What python implementation is that? sci-kit learn? And how much worse is performance in practice in Max when validation is turned on?

As for eventual implementation, most likely fit and fitTransform / Predict will output a 2nd item for the validation loss if @validation > 0 → i.e. fit <training loss> <validation loss>

1 Like

I think so, though I can check with Jordie.

We ran this test early on, and then just switched over to the Python-based training then outputting a .json in the fluid.mlpregressor~ format to do inference still in RT in Max.

Again, I can check. The main thing is that if I did the training directly in Max where I couldn’t check the validation loss, I was just guessing as to the overfitting and likely always overfitting since I only ran it “until the line went down” and that line was just the training loss and I “want it to be low”.

Yeah that’s a good call, though I can see situations where that can break existing patches since it can be coded such that there’s a route fit and then expects a single float after that.

This value is currently computed right? It’s just not reported per fit?

It’s computed internally to the neural network training code but doesn’t currently get passed back out, so there’s a wee bit of rejigging needed to change that

Hmm, yeah. OTOH, if people are doing stuff with the training loss number, they maybe shouldn’t be. We’ll certainly think about it once it becomes an actual prospect, but I’d lean in the direction of introducing a small breaking change for this

1 Like

Even just a silly thing like my patch above would end up sending a list to multislider which would then no longer display in the “rolling” mode (if I didn’t explicitly cast it as a float in the t b f I guess).

Beyond visualization stuff I’ve never done anything with that number but have been burned with “changing the amount of values output” from a FluCoMa object before where it broke logic in a way that was tricky to narrow down.

It definitely is more elegant/efficient though, to have both come out at once.

Yeah, it’s because its only real use is display and qualitative interpretation that I’m mildly relaxed about a breaking change.

It won’t break the scrolling view on multislider afaik, because that’s an attribute. If you send a list to it (in any display mode) you just get an extra trace – which might be a mild PITA if you’re making a nice UI for training and have twiddle-able @validation, because you’d need to reset multislider@size.

1 Like

I basically agree, but right now it’s kind of the only way of knowing “have I trained it enough?” and our overfitting protection with @validation is quite hidden and therefore hard to reason about as Rod is pointing out here (and in particular hard for students to grasp since it doesn’t report out anything, we can’t know if early stopping happened, so it sort of feels like validation is not that important, which I would disagree with).

I agree.

I hope it can be useful to toss in a pedagogical perspective, cuz I’m in the middle of that right now. I think for a learning, having the two numbers pop out is a bit too auto-magical (I think I’m still for it though). For someone who doesn’t understand validation, they will probably feel like the same-ish number and students will be confused as to which is which and what the difference is.

I think what would be more friction–and therefore more pedagogical–is to have an object called something like [fluid.datasetpartition], [fluid.datasetkfolds], [fluid.datasetsplit] or something that (I’m riffing on the name [fluid.datasetquery]) does a training-testing split or k-folds partition on training data. That way we’d have to make and see the different objects named [fluid.dataset~ training-partition-data] and [fluid.dataset~ testing-partition-data] (and [fluid.labelset~ training-partition-labels] and [fluid.labelset~ testing-partition-labels], etc.).

I think conceptually (and therefore pedagogically) it would be much more clear. Of course the downside is that it’s more patching, and foot guns I suppose, but I’m ok with this.

FWIW this is what I’ve been doing with students (using an evolution of the patch I showed above) and it really seems to make validation clear. The FR from me here is an easy way to do the partitioning.

2 Likes

thanks @tedmoore for these insights. In the meantime, can I suggest to road-test the pedagogy of this object by doing a fake datasetsplit? I can think of 2 ways:

  • making a dataset with same labels and a random class value integer, then datasetquery has a method to assemble data from 2 datasets. You would query class = 1 and class = 0 and you would get 2 outs.
  • dumping in a dict, shuffle and split.

Is that what you did?

I’m for having the pedagogical angle with things, but it would be brutal if one had to manage this as part of the actual training/learning process (I don’t think (or hope!)) you’re suggesting this.

Part of why I wasn’t initially so keen on having it output two floats as without labels we’re kind of full circle into FluCoMa toolbox 1 where arguments were set as an array of unlabelled values (e.g. fit 0 1 1 0 1 1), which felt absolutely impenetrable. So having fit 0.3 0.8 on an output is a slight return to that as you need to know what they are, and what they mean, in order for it to actually be useful. With fit 0.3 on it’s own it’s a simple “number goes down, this is good”-type thing.

@tedmoore, all this is as well as, not instead of the idea of a partitioning object that we’ve already discussed a couple of times.

A partitioning object throws up some interesting interface questions, so any detailed ideas you have for how this thing should behave in Max/pd/SC, hmu.

1 Like

But that’s the exact problem – it’s not necessarily a number-goes-down-this-is-good thing, because over-fitting, right? Reporting the validation loss will be an improvement on what we have, but – belt-and-braces – you still want a way to evaluate against a properly held out test set that’s not been involved in training at all. (And, even better, to have a smooth way to run this against multiple possible partitions (‘folds’) to guard against flukes, but also has applications in helping select features and tune network parameters)

This is one of those areas where coming up with a useful interface for artists is going to be juicy, and may well have to diverge from how straight up ML toolkits like sklearn offer this functionality.

2 Likes

Totally, I’m in agreement there. My mentioning of “number goes down” is in a critical sense as it omits the rest. So folding a second number in as a list means it’s easier to lose the significance of the second number and it’s importance. Even though it’s clunkier, my original suggestion of having a second, explicitly different, name sent out would mean that they are decoupled. So you can conceptually (and visually/patching) separate fit (i.e. trainingloss) and validationloss and can then plot them like @doett does in the other thread:

You can obviously still unpack a list after route fit, but you would need to know that you have to do this in a first place and why.

Having spent a week steeped in some of this stuff from a more datascience-y perspective, and seeing the manifest difference in what a properly trained NN can do vs what I was doing with my iteration loop, I know that it’s not a simple problem when it comes to UX, and short of some kind of hyper-parameter tuning thing, the results will be limited to whatever architecture you pop in when making the object in the first place.

Being able to plot both sets of losses (in the first place), and perhaps having this baked into the patches/examples with some context, would go a long way to getting some better trained NNs within Max without having to just set a huge fit message and pinwheel until it (presumably) returns back that it’s bailed at some point.