Making sense of fluid.mlpregressor~s autoencoder

rodrigo.constanzo · March 17, 2021, 11:53pm

Had a great geekout with @tremblap this afternoon and he walked me through his workflow when using fluid.mlpregressor~ as an autoencoder.

I was initially comforted by the fact that the faff, and the results he was getting were in line with what I had experienced up to this point (when being used with “real world” data, vs toy examples).

Off the back of that, I’ve spent the better part of the afternoon/evening running long training things, tweaking settings, clearing, rerunning, and plotting the output to see (and hear) how it’s performing.

I have to say that after hours of this today, I’m no closer to having a useful reduction, particularly as compared to UMAP or even PCA on the same material.

By an absolute longshot, this is the best fit I managed to get:

Clocking in with a reconstruction error of ~45.

The clustering here is, as you can probably imagine, dogshit.

What’s in this dataset is 76D of MFCCs/stats (20MFCCs (no 0th) with mean/std/min/max) run on a mix of different percussion samples, and is exactly the Timbre process from this patch.

Here’s the same data with PCA:

Raw UMAP:

Standardize → UMAP:

Robustscale → UMAP:

So a couple of things strike me. One, the interface is quite different from other similar objects (predict vs transform for example), and then some of the feedback is a bit weird. For example, the reconstruction error is, apparently, a sum of all the errors, so it scales up with the size of the dataset. This is really confusing as seeing an error of “45” seems unbelievably bad (which I guess it is), but the magnitude of this error is relative to the data which it is attached to. Not sure why this number wouldn’t be normalized to 0. to 1. so relative comparisons, or a sense of scale would. make sense.

Also, it would be really nice of the @activation info (via attrui) included the range in the name (e.g. activation (3: Tanh (-1 1)), or even just activation (-1 1)) or was even only the range information as that seems more significant/useful than the function being applied.

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

Lastly, and I guess the main point of this thread, is the general workflow with autoencoders super fiddly, and potentially not viable for certain types of data? Up to this point I had chocked up my failure to get useful results from a lack of understanding (which I still have in spades), but after speaking with PA and seeing and replicating his workflow, I’m nowhere nearer getting something that converges in an even remotely useful way.

Is this indicative of me needing to learn the voodoo/dance better? Are MFCCs not good features for autoencoding? Are certain feature spaces, or specific instantiations of feature spaces, not suitable for autoencoding?

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

This is the data I ran the above on:
mfccs_76d_DPA.json.zip (537.6 KB)

And here’s my “good” fit:
mlp 45_2.json.zip (39.8 KB)

weefuzzy · March 18, 2021, 1:15am

Hi @rodrigo.constanzo,

I’d need @groma to provide more fulsome explanation of how best to approach working out what an autoencoder couldn’t / couldn’t do for you given a particular data set, but I can at least make some general comments.

tl;dr, no you can’t conclude from this experiment that MFCCs and autoencoders don’t mix, nor that they’re not viable for certain types of data. The main problem here is that there isn’t enough training data (772 points) for the size of the model (almost 4k parameters).

Your network has, from what I can see*, three hidden layers of sizes 24 3 24, and input / output layers of 76D. The layers in the MLP are fully connected, which means that it has to learn a weighting for a connection between every input and each neuron on the next layer, and then between each neuron on this layer and each on the following, and so on; it also has to learn biases (offsets) for each neuron in the hidden layers. So, the number of parameters (weights) the network needs to learn in this case is

(76 * 24) + (24 * 3) + (24 * 3) + (76 * 24) + (2 * 24) + (2 * 3) => 3846

The first rule of neural network club is that you need more data than you have parameters to learn in the model. I know you’d like an exact answer to how much more, but generally one doesn’t know exactly. However, rule of thumb: 10x more training points than parameters to get started. Ergo, for a network architecture this size, you’d want ~40k training points as a starting point. (Just to make life fun, if you have too much training data, you can over-fit, but that’s certainly not an issue here).

Even if we remove the 1st hidden layer with 24 neurons (reducing the parameter count to ~500), you’d still want a good chunk more training points. Short of actually generating a whole heap more (which may be the simplest approach, if it’s not too labour intensive), some (potentially parallel) approaches spring to mind:

Do you really (really?) need 76 input dimensions. Are they all doing useful work relative to each other? I think I at least described how PCA can be used in starting to make an assessment on this front in another thread, even if I didn’t get around to demonstrating it.
IIRC, although I don’t have a reference to hand, with autoencoders being kinda-unsupervised, one can ‘make up’ new training data based on transforming what’s there, without needing to also produce labels. (Cough. This is where I’m hoping my kindly colleague will step in). Possibilities might () include adding noise to original training points; or going back to the pre-statistically-summarised points and summarising at different boundaries to get different (but ‘true’) reports of means, derivatives etc.

It’s also possible that a round of PCA beforehand could be useful in picking off some dimensions. Although MFCCs are, in principle, well decorrelated thanks to the discrete cosine transform having some PCA-ish properties, it still may be that after all the post-processing, there’s fat that PCA can cut, and make the neural network leaner. Come what may, though, unless you whittle things down to the point where an autoencoder was only learning ~72 parameters, you’ll want more data from somewhere.

So: we’re not near the voodoo yet. Just feed it!

On activations, another rule of thumb: just use reLU unless there seems a pressing reason not to. Don’t get hung up about ranges either, because the thing is learning (potentially bipolar) weights and biases. With the exception of very tiny networks, or networks with feedback connections (not an issue here), reLU is likely to be much less faffy every time.

/////////////////
*[From what I can see] There’s a slightly embarrassing problem that MLPs loaded from file aren’t updating their object’s attributes yet, so I have to dive into the debugger to confirm the details of the MLP settings.

tremblap · March 18, 2021, 8:59am

I keep forgetting this rule of thumb. This is maybe why when we tried 96 [2] 96 it converged better and faster…

tedmoore · March 18, 2021, 9:43am

I’ve also had less success using autoencoders (although it was maybe 2 years ago now that was really in it). And I was getting similar looking results as you(!) with points around the edges all smooshed up against each other.

Upon reflection and being reminded of this

it makes sense to me that this was probably my problem (as well?).

//=======================================
@weefuzzy, regarding neural nets more generally, is there somewhere I can read more about this:

I feel like I’ve had most success with sigmoid (aka. “logistic”, I mention just for posterity) in the hidden layers and then identity in the output layer, or sometimes sigmoid in the output layer. But if I should re-investigate reLU, some materials to dig into could be interesting.

Also, regarding above, what is the threshold for a “very tiny network”?

Thank you!

T

rodrigo.constanzo · March 18, 2021, 3:58pm

Thanks for this. There’s so much knowledge here that overall falls into the “it depends” quadrant, but at the same time are “things you should know, and not really deviate from”.

Well, I’ve been experimenting using as few dimensions as possible but this is stemming from “the other side” where I take a load more features/stats and then run it through some kind of funnel to pull things down from there. So the 76D feels like a modest amount of features to start off with.

I hadn’t thought of this as it seemed either redundant or “bad practice”, particularly with how brutal PCA is. I’d be more inclined to do some of the SVM stuff from the other thread to prune down to the most salient features before running them through some kind of destructive/transformative process like PCA/UMAP/MLP/etc… Don’t know if that’s just my lack of knowledge, but it seems that having multiple layers of transformation this way is akin to transcoding audio, which can introduce compound artefacts along the way.

I guess in general with this, if I have a “small” amount of samples in my corpus (<40k), but have a “large” amount of descriptors (>70), I would have to add some additional steps to overcome the initial fit-ing of the network?

Lastly, is the faff and twiddling required to train an autoencoder something that (generally speaking) leads to something that better reproduces/captures/clusters/whatever the information in a dataset? And I suppose in non-linear ways that wouldn’t otherwise be possible with PCA/UMAP/TSNE. Or rather, is the autoencoder a whole load of work just to have something that’s “different” from PCA/UMAP/TSNE?

I guess there’s specific flavors or reasons for each activation type to be there, but is this then another parameter for “test and see what you get”?

weefuzzy · March 22, 2021, 4:02pm

Not sure what you mean by brutal in this context?

Yes and no. Yes, insofar as what’s at work is an accumulation of different modelling assumptions, but no insofar as the metaphor doesn’t stretch all that far: reduction isn’t necessarily the name of the game here (although I get that for your specific purpose here, of trying to reduce whilst preserving some sense of perceptual intelligibility, it resonates).

To reiterate: <40k or whatever is only small or large relative to the task at hand. 40k would be a lot to put through UMAP, for example (though do-able). In this specific instance, if you want a multi-layer neural network to learn relationships across >70 dimensions, then the corpus is small for this task.

Autoencoders are fundamentally different, so neither intrinsically better or worse: they’re trying to model based on a different objective than say MDS, UMAP or tSNE on the one hand, or PCA on the other (which does something different again). For UMAP et al, they are ‘about’ trying to model the data in a lower dimensionality so that small distances in the original dimensionality remain small in the new one. An autoencoders realised through an MLP is lower-level, insofar as there are fewer modelling assumptions at work. It’s just trying to minimise how badly it produces the correct output for a given input, so it doesn’t even really have a concept of distance or space going on. Lower level means more fiddling, but with the pay off that a trained and validated network can be really zippy, and amenable to incremental tuning and re-fitting to new data in a way that’s much harder (often not possible) with the specialised dimension reduction things.

Interestingly, I was reading something by Laurens van der Maaten, who was one of the people who developed tSNE, noting that by and large autoencoders aren’t so useful for visualisation (which isn’t the whole of dimension reduction), so perhaps for your goals here, it wouldn’t be the natural choice. He also noted that multilayer autoencoders are in general hard to train, because they can get stuck in local minima very easily.

About the faff, prompted by this conversation I’ve started building something to make interactive tuning a bit easier to get started with. Because the loss numbers themselves aren’t so important as their trajectory, the early bits of fiddling are much easier if you can look at the trajectory of the loss over time as you experiment with different learning rates etc:

rodrigo.constanzo · March 23, 2021, 12:54am

A whole bunch of great info, again.

I’m still not sure, practically speaking, when it’s better to go for more descriptors/stats then reduce/transform later, or take less while still transforming(/clustering) etc… A lot of permutations possible there with the differences being difficult to discern.

On semi-concrete follow up question:

In the sense that if you’re trying to fit things, but doesn’t the loss tell you something about how useful what you’ve fit is in the first place? Like, it could improve a ton, but still be dogshit overall (and perhaps not worth the work/effort of massaging the numbers to improve).

More concretely, the reconstruction error is a sum of all the errors for each node in the system? Like, is it proportional to the overall network size (e.g. 7 3 7) or the possible combinations (e.g. 3846, like in your first response). Like what does an error of 45 mean in the context of the initial example (where the amount of datapoints are nowhere near enough to properly fit the network).

weefuzzy · March 23, 2021, 1:20am

Well, bearing in mind one of your other current threads, a bit of both: if you can pair down the input dimensions to get rid of redundancy through, e.g. correlation or low variance to start with and then find a minimal chain of processes that gets you somewhere useful, that’s the dream.

The loss for the MLP doesn’t depend on the network architecture, but it does depend on the range of your target values (i.e whatever it is you’re fitting to). The loss value is the sum of the squared differences between each prediction and the training point, divided by the number of samples in the batch. So, the range of the loss will scale with the dimensionality of the output vector, but is also nonlinear: i.e. it will look much scarier for bigger errors.

So, sure, you can totally work out some ballpark ideas of what’s reasonable: if you have 76 dimensions in the range 0-1, then clearly 76 would be a bad, bad number. If they’re all in the range 0-2, then the maximum loss becomes 2² * 76 (304). If the ranges are heterogeneous, you can still use fluid.normalize either to normalize the data, or just to peak at the ranges in the training set (but then it’s harder to reason about the overall wrongness)

Playing around with your MFCCs earlier whilst tinkering with the patch above, I found I was getting more rapid convergence if I normalised, so I know that the outputs were in 0-1. In the screenshot it’s reporting a loss of 1.7-odd, which would average out at an error of ~0.15 (√(1.7/76)) in each dimension, which isn’t (yet) wonderful but not totally horrible either. To get that down to a 10% average error would be a loss of 0.76 (in this case).

rodrigo.constanzo · May 19, 2021, 7:14pm

Bumping this with some short-term thoughts while I process things (also in the middle of marking season), but had a cool chat with someone (Gergő Kováts) after the talk I gave at Notam a couple of weeks ago.

He mentioned a couple things of interest that I wanted to bump this thread with.

The first is the idea of data augmentation. I don’t know how applicable it is for music, but things like changing the phase, amplitude, transposition, etc… of the available data to create a larger footprint of similar/usable data to feed into a network, to hopefully mitigate some of the problems of needing a dataset proportional to the given/needed/useful network size.

The other exciting thing was the prospect of doing some heavy lifting in another environment/context (specifically Google Colab, though I guess also Python, etc…) and then using the pre-fit network inside the fluid.verse~. On that note, presuming the same network structure (e.g. 20->10->4->10->20, or whatever), is it possible to take an external fit and load it into fluid.mlpregressor~? I remember there being a whole thing with retaining a similar structure as scikit or whatever, which I would presume would let things like this be possible.

He said he was happy to take some datasets from me to create something in colab to crunch them numbers, so hopefully it would be possible to user later in Max. My plan here is to create a new/comprehensive datatset of sounds I can make on my snare (up in the thousands, where my testing one has been around 600 I think), and then try building two datasets from that, one from 256sample’s worth of analysis and the other from 4410.

The last idea, though less directly applicable, was that of disentanglement, which I guess is related to the idea of salience but different.

Either way, don’t have a load of headspace to process this yet, but wanted to write some bits down that seemed interesting and to also see how viable importing pre-fit network stuff was.

spluta · May 19, 2021, 8:41pm

Love the colab idea. I was thinking this same thing today since my uni has a giant cluster that I can use, but it only runs python.

mmmm…swift

weefuzzy · May 19, 2021, 9:33pm

Yes, in principle you should be able to use weights from a MLP model trained elsewhere, so long as you can jam them into the appropriate JSON format for our object (which isn’t super documented yet, but certainly not at all impossible). Bear in mind that you’ll need to limit yourself to the activation functions that we support.

(The correspondence with the scikit-learn interface isn’t so much the issue here, as the correspondence between the internal data structures that the weights and biases are kept in).

Data augmentation is certainly worth a try, but not everything will pay dividends: changing the time-domain phase, for example, won’t do a great deal if you’re then just using Mel bands or MFCCs as your principle feature: basically, the key is to think of augmentations that make some sort of sense for the data the model is likely to encounter at run time. For your percussive hits, perhaps subtle time stretches would be useful, and maybe even small amounts of saturation.

rodrigo.constanzo · May 19, 2021, 10:23pm

I guess if it’s in data(/Pyton)-land, it’s easy enough to transform the munge the numbers to be in the appropriate order/format.

That’s a good call about the functions. I was mainly thinking the structure, but this is just as important (or more important I guess).

Beyond that, is there any other secret sauce stuff that wouldn’t be compatible? In terms of how the functions are computed or scaled etc… (not in terms of the data feeding into it (ala standardize), but in terms of the maths).

I see. Yeah that makes sense. Actually that go could a really long way I would think, just having a whole mess of +/- 0.001 duration transformations and/or saturation amounts. Or perhaps a “smart”-er approach where you have a dataset that contains x amount of points, and want a network of n size, so pressing a button that “chubs that shit up” to be the minimum useful size of the network.

weefuzzy · May 19, 2021, 10:30pm

Don’t think so: many of the parameters of the MLPs are to do with the training: once you have a trained model, everything should be completely standard.

spluta · May 20, 2021, 10:05am

The question I have about Colabs is how long it will be around and how long it will use Swift with Chris Lattner no longer on the project.

That notebook environment they have for Swift is pretty sweet.

rodrigo.constanzo · May 20, 2021, 10:19am

Google never prematurely kills off projects that are enjoyed and used by many people.(!)

spluta · May 20, 2021, 11:21am

And after banging my head against the wall to get this working, this may be the case here:

No more swift in colab. But they do have python.

jamesbradbury · May 20, 2021, 1:00pm

I don’t know what Collab necessarily offers over just running it locally unless you really want to do GPU training which is problematic in and of itself. In my experience its not much faster on the free tier than just running something on my machine, and then you dont have vendor lockin to boot. I’ve already made an adapter for Python > Dataset in python-flucoma · PyPI if anyone wants to try it out. That would get you halfway to transforming stuff pretty easily for the Max side of things. They might be outdated though given how things have changed and my lack of involvement coding anything lately.

EDIT:

I realise I sound kinda narky - i had some bad times with early google collab. What I didn’t think about was how it simplifies the process of working with python who dont want to setup a local dev environment (which is a PITA sometimes). I wonder how easy it is to get audio up into a collab?

jamesbradbury · May 20, 2021, 1:01pm

Also just to clarify Collab runs Python and is not a language itself so you are tied to writing python if you do want to use it.

jamesbradbury · May 20, 2021, 1:05pm

And just to continually add fuel to the fire I would be super interested in training something like this to be used in Max:

https://towardsdatascience.com/one-shot-learning-with-siamese-networks-using-keras-17f34e75bb3d

They looked pretty snazzy and contemporary to me when I first looked which is a while ago now, but perhaps super relevant to those who want to do classification with not much data.

spluta · May 20, 2021, 1:22pm

One shot learning in 2005 on HSN: