Dimensionality reduction, disparate spaces, and speed

So in light of new bits coming out soon, I’ve been thinking about how I might apply dimensionality reduction.

I’m sure a bunch more will become clear once we have the new tools, and @tremblap mentioned doing a video walkthrough of the new stuff, which is great.

So if you have a large multi-dimension descriptor space, you can apply some dimensionality reduction to it and end up with a descriptor space that is better(?!) suited for searching, or at least more efficient, or different, etc…

That makes sense.

So if I have an source and I want to find nearest matches, both of these would be (presumably?) run through the same descriptor and dimensionality reduction algorithms in a manner that (after normalization/standardization/sanitation/whatever), one could browse for one with the other.

I also gather that the way the descriptor space is processed will (also) have a large impact on the matching and such. I’m still with it here.

So where I’m a bit lost is how this would apply when working with disparate spaces, where you want minimal latency. Specifically the stuff I’ve been working on where I have multiple stages of analysis being applied to incoming audio, with varying amounts of descriptors and statistics being analyzed for each stage. How does this apply to a much larger descriptor space where time is no issue in terms of analysis.

More concretely, an example.

Snare input, super low latency, with initial analysis window of 64 samples, and perhaps a 512 sample latency overall (with a another analysis stage in there that is staggered).

And then using it to navitage a 3k+ sample library of metal sounds.

64 samples is not a lot of time, and certain descriptors don’t make sense at that time scale (i.e. pitch). So for that initial burst I mainly just do loudness, centroid, and flatness. The longer my analysis window, the more descriptors/stats I start incorporating. Meaning I have a variable amount of dimensions that I’m starting off with.

So is the general idea that I would apply dimensionality reduction to the large sample library and end up with x amount of dimensions (lets call it 3 for now) which describe the overall descriptor space.

I then have my input/real-time analysis, with far less dimensions available. Do I also then reduce that to 3 dimensions as well? Is what is significant about the real-time stuff that has been dimensionally reduced in some way mappable onto another dimensionally reduced space?

Like if my drums tend to be muffled hits, perhaps timbre isn’t massively important, whereas the opposite may be the case for the sample libraries.

Basically I’m having, conceptual, trouble figuring out how dimensionality reduction and mapping/querying of spaces work when the source/target material are very different.

Obviously all the normalization/standardization would, potentially, mitigate the differences in scale for everything, but perhaps not the significance of what the algorithm(s) have chosen to reduce down to.

And then finally, the issue of speed/latency. So at the moment, in the other thread, I’m working out a multi-stage analysis approach so the initial tiny fragment is matched, crudely, with something from the database, then moving on and on. In a dimensionally reduced ML paradigm, this wouldn’t work(?).

Is the idea that once you go into dimensionality reduction and ML querying, that it’s an all-or-nothing approach?


So yeah, lots of conjecture and spitballing here, but already priming my brain for upcoming bits.

Main takeaway question(s), I guess, can be boiled down to:

How does dimensionality reduction work when you have (sonically) disparate spaces with varying amounts of underlying descriptors and dimensions?

I guess at that point what you’ve got on your hands is a mapping exercise. It’s possible of course that one could get great results just by tying dimension n of input to dimension n of (different) lookup space,
but in practice you’d probably want something else inbetween to help you fine tune the translation (e.g. a KNN regression). But that’s generally the case if you’re trying to reconcile these two different spaces: how they relate to each other is going to be a musical decision, above all.

The point about using something like a regression or a classifier as a mapping device is that they are supervised, unlike the dimensionality reduction stuff we’ve encountered so far, which is unsupervised. As such, it’s a matter for you to give it examples of how X maps to Y, see if you like the general results, then perhaps add more examples, or go back with different features, and generally rinse / repeat.

As for the practical matter of whether you’ll be able to work within your 64 sample threshold using this stuff, we’re back to It Depends (but only a bit). We can say with some certainty that the more stages there are, the more computation time you need; and that many dimension reduction techniques are non-trivial. You could always have a mapping between a dimension reduced lookup space and a non-reduced real-time input.

Yes. Once you’re into comparing these disparate things, standardizing in particular starts to be quite useful. But this applies as much to the supervised stuff as to dimension reduction.

1 Like

Hmm, that’s a good way to put it. As a mapping thing.

What if what I want is to tell it “things that sound like each other, should be mapped to each other”. Like, is there an unsupervised supervised algorithm where you just use (perceptually meaningful) descriptors to train the regression algorithm?

I suppose it’s easy enough to give it examples of what kind of sounds might be used as inputs (by playing a range of sounds/techniques (though this would, conceptually, feel a bit limiting if I’m defining the field before playing ball (not a useful metaphor I realize, since that’s exactly how you play ball))), but it would be harder to do with arbitrary samples.

Again, I suppose part of it would be to create some kind of dimensionally reduced space/map, then browse the clusters and be like, I want to trigger those kinds of sounds with these kinds of sounds, but those are the kinds of decisions and processes that I’d like to, as much as possible, avoid. I haven’t fully unpacked why, but this kind of (pre)compositional decision making and thinking I find quite uninspiring/uninteresting. Not opposed to it, just the opposite of excited about the prospect.

Like, I’d sooner accept a slightly more arbitrary algorithmically, but conceptually more simple, “mapping the expressive range of each to each other”.

That’s interesting. It strikes (struck?) me that something like that may have to be the case when working with tiny windows and/or staggered windows, where the “legit” ML route might not be quick enough. (I’m bumping the hybrid stitching thread with my findings so far on this approach).

If the data are described by the same features then you can fit a dimension reducer to the lookup dataset and then re-use the fitting on new datapoints as they arrive. But it sounds like you’re not making life that simple?

If you’ve got stuff that desribed in quite possibly different ways, then there’ll have to be some mechanism for denoting what ‘similar’ sounding is, and this will involve (probably minimally) using a common set of features as the basis of a distance measure during training, “sounds like” in a general sense is hard!

What I’m thinking is that you’d want to train a mapping between dimension reduced space A and dimension reduced B based on minimizing a distance derived from (e.g.) the MFCCs ‘behind’ points in A or B, if you see what I mean. This might be garbage, and I have no real notion about how / whether one could start to attack with what’s on our menu so far. Hopefully @groma will come and resuce me…

Indeed I am not, hehe. The main problem is being able to analyze static files offline in a detailed/slow way, while not having the luxury to do the same for real-time sounds where, at most I’d want to wait no longer than 512 samples. So the analysis time, as well as the meaningfulness of what can be done in the short term won’t ever be the same as the long one.

I guess I can just limit my offline analysis to what I can do in real-time and just map them accordingly, basically going a “lo fi” approach overall.

Heh. Yeah. I guess I meant just auto-mapping things that have similar/same loudness/centroid(/pitch(/mfccs)) to each other, rather than having to manually go through and tag sounds corpus sounds to perceptually corresponding clusters.

I just remembered some discussion towards the end of the 2nd plenary when @groma was showing his dimensionality reduction iPad thing. Basically I inquired if there was some kind of hybrid between dimensionality reduction and self organizing maps where you can reduce the dimensions, then kind of seed the SOM with another set of parameters (i.e. loudness and brightness being the 2 dimensions) (MDS?).

Perhaps something like this could do the trick where there can be an arbitrary amount of dimensions for offline analysis, but then that is remapped/organized via a lower dimensional space.

Again, spitballing, and largely uninformed speculation(!).