MFCC to lower dimensions via UMAP (best practice)

In this weeks geekout we got to talking about the thing I built in this thread and @tremblap mentioned that he wasn’t seeing such good clustering with the MFCC space of 76D (mean/std/min/max) → 4D reduction I was applying. I didn’t see a massive correlation either, but I chalked it up to it being a 76D->4D->3D reduction that was making it shitty.

After some chat (with helpful input from @jamesbradbury and @tedmoore, as well as @tutschku) I decided to revisit that part of the patch, but by figuring out what UMAP settings to use by visualizing 2D/3D reductions of the 76D space, with the idea of then applying these same UMAP settings to the 4D reduction.

There was also some discussion about standardizing before UMAP or feeding the output of MFCCs directly into UMAP.

Here are my tests/results so far (with a video below).

So using the default UMAP settings I was using before (pulled from the helpfile) I get this:

The standardized one (on the right) has more of that stringy/stretched thing I’ve been seeing, which I thought was a “double UMAP” artefact.

Even with this I get much better ‘clustering’ without the pre-standardization, but in massaging the settings a bit more I get this:

Obviously some weird outliers there, but I suspect that’s due to the dataset being normalized completely for visualization. If I understand fluid.robustscale~ correctly, the main clump on the left will occupy the main fluid.kdtree~ space, with these outliers existing, but being waaaaaay out there.

I also ran the same thing over and over again to see how much it converges and it’s pretty good. Wiggles around a bit, but nothing super dramatic. I’m not entirely sure how much movement is “too much” in that sense.

Here’s a bunch of examples of the same dataset being refit:

Here’s a quick video:


So in addition to just sharing the results of this (which have been great), I wanted to talk about visualizing reductions (particularly when your target is >3), and how generalizable UMAP settings can be.

With regards to the first one, I hadn’t really considered tweaking settings in 3D and then extrapolating that back up to my target dimensionality (4D in this case). I was just relying on 76D → 4D → 3D, and expecting that process to go badly along the way.

As far as the latter, as you can hear from the video, these sounds are roughly in the same ballpark of sounds. Metalic-ish percussive sounds, and all analyzing with tiny windows. So perhaps the UMAP settings above work well for this particular corpus, but may be problematic for other datasets. This will be less cumbersome if some kind of dynamic “analysis pipeline” is implemented where each individual corpus can be tailered and massaged around, and a corresponding realtime/matching analysis pipeline can be reconstructed around those settings, but I’d like to figure out some settings that “work” (well enough) across the board.

Oh, and this is what the new UMAP settings look like propagated back into the main patch.

The 3D version:

And 2D:

Keep in mind that Timbre is still going through two UMAP conversions for this (76D → 4D (what goes into the composite dataset) → 3D for viz), so that must bring some artefacts I would imagine. Still looks better than what I had before:

My guess is that if you don’t standardize, then the distances initial nearest neighbours that UMAP starts with will be dominated by whichever features have the greatest range (and these can vary a great deal between MFCCs). So, if you’re getting better results (relative to whatever it is you want to happen) without standardising, what this suggests to me is that there are features in your data that are hindering the discovery of the sorts of structure you’re after, and this becomes more pronounced once the features are all squished into comparable ranges.

From the above it doesn’t seem like you’re changing the UMAP parameters all that drastically. Going from the discussion at Basic UMAP Parameters — umap 0.5 documentation , the stringy-ness can be a consequence of having numNeighbours set relatively low, so that the focus is very much on local structure, whereas you might want more of a balance between local and global. Conversely, lower minDistances seem to promote more clumping, so perhaps pushing the latter down and raising the former will help.


Thanks for this thorough follow up to the discussion indeed!

Yes this is what that does.

How long is a piece of string? This ‘error’ is variation on the ‘error’ induced by dimension reduction. If near stuff (for your musicking) is near stuff (on the graph) then you’re golden. I will now need those settings to try on my dataset which triggered the discussion! Exciting!

the 2d left graph has managed to get a pitch axis (top to bottom) which is fun. @groma and @weefuzzy told me about the MFCC liftering IRCAM is doing so that might also be worth exploring (although I don’t know what they do it, and I know @weefuzzy has done some work in comparing various flavours of MFCCs last year in discussion with @tutschku )

One idea I proposed earlier is to use MLP as autoencoder to extract a 4D latent space, aka 4 dimensions that explain best the 76D without being human-readable. Maybe we should do a live parallel coding session you and I where I code that bit in your patch and we see where it leads us. You could ask me what I do and why as I do it for instance… not that I am the master of such algo but then I presume I need to learn more and @weefuzzy can learn from our joint mistakes?

1 Like

I meant more in the sense that you seemed to be looking for “an amount of movement that was ok” (as in, it “converged”). So I don’t know how much that looks like. The lumps sort of stay in the same place, so I guess that’s what it should be doing.

Definitely not opposed to some secret sauce to improve things here. Particularly given that:

That would be amazing! Particularly now that I have a very ‘real world’ pipeline/dataset/usecase.

that’s my vibe too… but hey, I’m not a data scientist either !

you mean one more, right? :smiley:

1 Like