Making sense of fluid.mlpregressor~s autoencoder

I keep forgetting this rule of thumb. This is maybe why when we tried 96 [2] 96 it converged better and faster…

1 Like

I’ve also had less success using autoencoders (although it was maybe 2 years ago now that was really in it). And I was getting similar looking results as you(!) with points around the edges all smooshed up against each other.

Upon reflection and being reminded of this

it makes sense to me that this was probably my problem (as well?).

//=======================================
@weefuzzy, regarding neural nets more generally, is there somewhere I can read more about this:

I feel like I’ve had most success with sigmoid (aka. “logistic”, I mention just for posterity) in the hidden layers and then identity in the output layer, or sometimes sigmoid in the output layer. But if I should re-investigate reLU, some materials to dig into could be interesting.

Also, regarding above, what is the threshold for a “very tiny network”?

Thank you!

T

1 Like

Thanks for this. There’s so much knowledge here that overall falls into the “it depends” quadrant, but at the same time are “things you should know, and not really deviate from”.

Well, I’ve been experimenting using as few dimensions as possible but this is stemming from “the other side” where I take a load more features/stats and then run it through some kind of funnel to pull things down from there. So the 76D feels like a modest amount of features to start off with.

I hadn’t thought of this as it seemed either redundant or “bad practice”, particularly with how brutal PCA is. I’d be more inclined to do some of the SVM stuff from the other thread to prune down to the most salient features before running them through some kind of destructive/transformative process like PCA/UMAP/MLP/etc… Don’t know if that’s just my lack of knowledge, but it seems that having multiple layers of transformation this way is akin to transcoding audio, which can introduce compound artefacts along the way.

I guess in general with this, if I have a “small” amount of samples in my corpus (<40k), but have a “large” amount of descriptors (>70), I would have to add some additional steps to overcome the initial fit-ing of the network?

Lastly, is the faff and twiddling required to train an autoencoder something that (generally speaking) leads to something that better reproduces/captures/clusters/whatever the information in a dataset? And I suppose in non-linear ways that wouldn’t otherwise be possible with PCA/UMAP/TSNE. Or rather, is the autoencoder a whole load of work just to have something that’s “different” from PCA/UMAP/TSNE?

I guess there’s specific flavors or reasons for each activation type to be there, but is this then another parameter for “test and see what you get”?

Not sure what you mean by brutal in this context?

Yes and no. Yes, insofar as what’s at work is an accumulation of different modelling assumptions, but no insofar as the metaphor doesn’t stretch all that far: reduction isn’t necessarily the name of the game here (although I get that for your specific purpose here, of trying to reduce whilst preserving some sense of perceptual intelligibility, it resonates).

To reiterate: <40k or whatever is only small or large relative to the task at hand. 40k would be a lot to put through UMAP, for example (though do-able). In this specific instance, if you want a multi-layer neural network to learn relationships across >70 dimensions, then the corpus is small for this task.

Autoencoders are fundamentally different, so neither intrinsically better or worse: they’re trying to model based on a different objective than say MDS, UMAP or tSNE on the one hand, or PCA on the other (which does something different again). For UMAP et al, they are ‘about’ trying to model the data in a lower dimensionality so that small distances in the original dimensionality remain small in the new one. An autoencoders realised through an MLP is lower-level, insofar as there are fewer modelling assumptions at work. It’s just trying to minimise how badly it produces the correct output for a given input, so it doesn’t even really have a concept of distance or space going on. Lower level means more fiddling, but with the pay off that a trained and validated network can be really zippy, and amenable to incremental tuning and re-fitting to new data in a way that’s much harder (often not possible) with the specialised dimension reduction things.

Interestingly, I was reading something by Laurens van der Maaten, who was one of the people who developed tSNE, noting that by and large autoencoders aren’t so useful for visualisation (which isn’t the whole of dimension reduction), so perhaps for your goals here, it wouldn’t be the natural choice. He also noted that multilayer autoencoders are in general hard to train, because they can get stuck in local minima very easily.

About the faff, prompted by this conversation I’ve started building something to make interactive tuning a bit easier to get started with. Because the loss numbers themselves aren’t so important as their trajectory, the early bits of fiddling are much easier if you can look at the trajectory of the loss over time as you experiment with different learning rates etc:

3 Likes

A whole bunch of great info, again.

I’m still not sure, practically speaking, when it’s better to go for more descriptors/stats then reduce/transform later, or take less while still transforming(/clustering) etc… A lot of permutations possible there with the differences being difficult to discern.

On semi-concrete follow up question:

In the sense that if you’re trying to fit things, but doesn’t the loss tell you something about how useful what you’ve fit is in the first place? Like, it could improve a ton, but still be dogshit overall (and perhaps not worth the work/effort of massaging the numbers to improve).

More concretely, the reconstruction error is a sum of all the errors for each node in the system? Like, is it proportional to the overall network size (e.g. 7 3 7) or the possible combinations (e.g. 3846, like in your first response). Like what does an error of 45 mean in the context of the initial example (where the amount of datapoints are nowhere near enough to properly fit the network).

1 Like

Well, bearing in mind one of your other current threads, a bit of both: if you can pair down the input dimensions to get rid of redundancy through, e.g. correlation or low variance to start with and then find a minimal chain of processes that gets you somewhere useful, that’s the dream.

The loss for the MLP doesn’t depend on the network architecture, but it does depend on the range of your target values (i.e whatever it is you’re fitting to). The loss value is the sum of the squared differences between each prediction and the training point, divided by the number of samples in the batch. So, the range of the loss will scale with the dimensionality of the output vector, but is also nonlinear: i.e. it will look much scarier for bigger errors.

So, sure, you can totally work out some ballpark ideas of what’s reasonable: if you have 76 dimensions in the range 0-1, then clearly 76 would be a bad, bad number. If they’re all in the range 0-2, then the maximum loss becomes 22 * 76 (304). If the ranges are heterogeneous, you can still use fluid.normalize either to normalize the data, or just to peak at the ranges in the training set (but then it’s harder to reason about the overall wrongness)

Playing around with your MFCCs earlier whilst tinkering with the patch above, I found I was getting more rapid convergence if I normalised, so I know that the outputs were in 0-1. In the screenshot it’s reporting a loss of 1.7-odd, which would average out at an error of ~0.15 (√(1.7/76)) in each dimension, which isn’t (yet) wonderful but not totally horrible either. To get that down to a 10% average error would be a loss of 0.76 (in this case).

2 Likes

Bumping this with some short-term thoughts while I process things (also in the middle of marking season), but had a cool chat with someone (Gergő Kováts) after the talk I gave at Notam a couple of weeks ago.

He mentioned a couple things of interest that I wanted to bump this thread with.

The first is the idea of data augmentation. I don’t know how applicable it is for music, but things like changing the phase, amplitude, transposition, etc… of the available data to create a larger footprint of similar/usable data to feed into a network, to hopefully mitigate some of the problems of needing a dataset proportional to the given/needed/useful network size.

The other exciting thing was the prospect of doing some heavy lifting in another environment/context (specifically Google Colab, though I guess also Python, etc…) and then using the pre-fit network inside the fluid.verse~. On that note, presuming the same network structure (e.g. 20->10->4->10->20, or whatever), is it possible to take an external fit and load it into fluid.mlpregressor~? I remember there being a whole thing with retaining a similar structure as scikit or whatever, which I would presume would let things like this be possible.

He said he was happy to take some datasets from me to create something in colab to crunch them numbers, so hopefully it would be possible to user later in Max. My plan here is to create a new/comprehensive datatset of sounds I can make on my snare (up in the thousands, where my testing one has been around 600 I think), and then try building two datasets from that, one from 256sample’s worth of analysis and the other from 4410.

The last idea, though less directly applicable, was that of disentanglement, which I guess is related to the idea of salience but different.

Either way, don’t have a load of headspace to process this yet, but wanted to write some bits down that seemed interesting and to also see how viable importing pre-fit network stuff was.

Love the colab idea. I was thinking this same thing today since my uni has a giant cluster that I can use, but it only runs python.

mmmm…swift

Yes, in principle you should be able to use weights from a MLP model trained elsewhere, so long as you can jam them into the appropriate JSON format for our object (which isn’t super documented yet, but certainly not at all impossible). Bear in mind that you’ll need to limit yourself to the activation functions that we support.

(The correspondence with the scikit-learn interface isn’t so much the issue here, as the correspondence between the internal data structures that the weights and biases are kept in).

Data augmentation is certainly worth a try, but not everything will pay dividends: changing the time-domain phase, for example, won’t do a great deal if you’re then just using Mel bands or MFCCs as your principle feature: basically, the key is to think of augmentations that make some sort of sense for the data the model is likely to encounter at run time. For your percussive hits, perhaps subtle time stretches would be useful, and maybe even small amounts of saturation.

1 Like

I guess if it’s in data(/Pyton)-land, it’s easy enough to transform the munge the numbers to be in the appropriate order/format.

That’s a good call about the functions. I was mainly thinking the structure, but this is just as important (or more important I guess).

Beyond that, is there any other secret sauce stuff that wouldn’t be compatible? In terms of how the functions are computed or scaled etc… (not in terms of the data feeding into it (ala standardize), but in terms of the maths).

I see. Yeah that makes sense. Actually that go could a really long way I would think, just having a whole mess of +/- 0.001 duration transformations and/or saturation amounts. Or perhaps a “smart”-er approach where you have a dataset that contains x amount of points, and want a network of n size, so pressing a button that “chubs that shit up” to be the minimum useful size of the network.

Don’t think so: many of the parameters of the MLPs are to do with the training: once you have a trained model, everything should be completely standard.

1 Like

The question I have about Colabs is how long it will be around and how long it will use Swift with Chris Lattner no longer on the project.

That notebook environment they have for Swift is pretty sweet.

Google never prematurely kills off projects that are enjoyed and used by many people.(!)

And after banging my head against the wall to get this working, this may be the case here:

No more swift in colab. But they do have python.

1 Like

I don’t know what Collab necessarily offers over just running it locally unless you really want to do GPU training which is problematic in and of itself. In my experience its not much faster on the free tier than just running something on my machine, and then you dont have vendor lockin to boot. I’ve already made an adapter for Python > Dataset in python-flucoma · PyPI if anyone wants to try it out. That would get you halfway to transforming stuff pretty easily for the Max side of things. They might be outdated though given how things have changed and my lack of involvement coding anything lately.

EDIT:

I realise I sound kinda narky - i had some bad times with early google collab. What I didn’t think about was how it simplifies the process of working with python who dont want to setup a local dev environment (which is a PITA sometimes). I wonder how easy it is to get audio up into a collab?

Also just to clarify Collab runs Python and is not a language itself so you are tied to writing python if you do want to use it.

And just to continually add fuel to the fire I would be super interested in training something like this to be used in Max:

https://towardsdatascience.com/one-shot-learning-with-siamese-networks-using-keras-17f34e75bb3d

They looked pretty snazzy and contemporary to me when I first looked which is a while ago now, but perhaps super relevant to those who want to do classification with not much data.

1 Like

One shot learning in 2005 on HSN:

2 Likes

That is a cute dog :dog:

1 Like

I was thinking about this today as I was planning on recording a much more comprehensive set of “sounds I can make with my snare”. Knowing that the corpus would be so specific to the snare, the head, the tuning, and the room (to a certain extent), and that it wouldn’t necessarily translate if I went to a gig and used another snare, or even just had my head drift in tuning over time is a bit of a bummer.

So that led me down a couple paths of thinking.

  • creating the minimum viable corpus for any given snare (maximum variety/dynamics, with a generous helping of data augmentation to fill in the gaps)
  • create a monolithic corpus for each setup I have and just streamline that process
  • thinking about the viability of having a mega-chunky-corpus, that is continuously fed new snares/setups/tunings and keeps getting bigger every time I use the system with a new drum
  • if it’s somehow possible to train a NN on some kind of archetypical aspects of the sounds (within the word of “short attacks on a snare”), which is then made a bit more specific with samples of the exact snare in any given setup

Part of that last example was remembering the topology if the machine learning snare thing that I was looking into a while back:
b0a315b38311ba316bc732a7f32d3e79a3bf2956_2_386x500

It could just be that this makes sense for the purposes of the patent application but from the looks of it, the NN is trained on data that is distinct from the user generated and trained aspects. In fact, remember when I last used the software, you would go into a training mode, and give it around 50 hits of any given zone (“snare center”, “snare edge”, etc…), and then come out of training mode and it worked immediately. There was never any computation that went along with it (unless it happened as you went and was super super super fast). You literally toggled in and out of training mode ala a classifier. But there’s an NN involved somewhere/somehow. How?