Making a well-sampled MFCC space for audio querying

staysh · December 31, 2023, 5:51pm

Forgive the bumping of this thread, it came up twice in my searches for info here, and the emphasis of the title is on the data, so I feel this is related…

Can anyone point me to some resources on corpus building? More specifically, wanting to take a bunch of audio and reduce redundancy via the 7-15 dimensional representation of it (MFCC). I’m working with flucoma in Max/MSP, but I’m comfortable in any language/environment if there are other workflows for this.

In 2 dimensions I could perceive the coverage of the feature space, but not in 13.

A bit more about my use case: I am slicing audio by novelty in mfcc analysis, then doing kdtree lookups via mfcc analysis of live audio. I would like to make the corpus as small and potent as possible. For instance, loading new audio as I collect it and incorporating slices from it where there are gaps in the corpus, and thinning out denser clusters. Real-time look up performance is fine on 13 dimensions, with well over 10,000 rows in the dataset.

I realize there is also the input space to think about and this could be modeled instead to map the closest to a given source and what locations it can access. (What could would it be to have data that a given input cant get close enough to?)

I’m looking for legibility, meaning for me, slices of input mapping closely to slices in the corpus. I figured having a highly curated, generalized corpus (at least in terms of it’s mfcc analysis) is one strategy.

rodrigo.constanzo · December 31, 2023, 7:13pm

I’ve not watched all of the new videos, but @tedmoore did an absolute massive dump of workshop videos a few days ago which may be of interest. These in particular:

(p.s. @tedmoore the thumbnail of this gives me Max anxiety…)

There’s also the classic vid by @jamesbradbury:

It’s been my experience that dimensionality has no (essentially) impact on speed. A KDTree is fast/optimized such that whether you’re looking through 13d or 250d data, it makes little difference.

This is, ultimately, where the secret sauce lies. Finding the descriptor space that makes the most for what you want to display/browse/represent. Be it a “simple” 2d thing ala CaTaRT (loudness/centroid) or a more complex/processed MFCC space, or a blend of things (LTP, LTEp).

tedmoore · December 31, 2023, 7:54pm

In my experience @rodrigo.constanzo is right. Also @rodrigo.constanzo has some great videos, many of which you can find on this discourse, and posts on this discourse that model the strategy of “knowing your data” by doing a lot of plotting, listening, and tweaking to really know what the audio analyses provide, what you care about as a listener/composer, and how to connect those two realities.

MFCCs tend to be a good general-purpose starting point. Also, it’s often not too much effort to do more analyses (spectral shape, pitch, loudness, etc.), then plot, listen, etc. and see what makes sense for you.

Something like this might get you going in that direction:

staysh · December 31, 2023, 11:10pm

Thanks for these responses.

Am I following correctly in that clustering can find dense areas and that I could maybe only allow new slices that are either not in the most dense clusters or a certain distance away, and audition the clusters and maybe start thinning/prunning from isolating and inspecting those groups?

Second question, and I’m really just winging this as I haven’t even done any more preliminary thinking as to the implications, but has anyone messed around with quantizing 13d space? I can’t really imagine how this works other than how it would in three dimensions, “does this cube have any points in it?”

…but seems like the amount of regions gets really big really quick… I guess a binary split on each “axis” would be manageable, then I’m only working with about 8192 “regions” that have hits/misses. Is this viable?

tedmoore · January 1, 2024, 10:05pm

This is something that you could do, yes.

My gut says that “quantizing 13-d space” wouldn’t bear much fruit. What are you imagining the benefit would be? A KDTree or KMeans would make more sense for a way to find relationships in 13-d space.

With 13+ dimensions, you’re starting to get cursed by the “curse of dimensionality”, so overly strategizing how your 13 dimensional space would “work” is probably not going to behave the way you might hope or guess (this goes for quantizing, but also for any distances, such as KDTree or Kmeans too!).

If you do pursue it, I’ll be curious to hear how it goes!

tremblap · January 5, 2024, 11:46am

Hello @staysh and welcome!

I’ve made a new thread with this fascinating question. Feel free to rename the thread but I thought it was a great discussion in its own right

p