Ways of finding the ideal number of clusters in a corpus

yogi · December 14, 2025, 10:19am

Hello,

I’m working on a patch where I combine corpora ( soundfile + slices )

& then organize the combined slices in clusters according to their mfcc analyses.

My question is, did anybody come up with ways of approximating cluster numbers in a dataset?

Cheers

tremblap · December 15, 2025, 12:52pm

I don’t think there is such a thing… it is very context dependant (data and use)… a bit more info on your intended use would help!

tedmoore · December 16, 2025, 1:48pm

Sometimes I plot in in 2D space using UMAP and eyeball it to pick a number.

Usually the number of clusters I choose is based on some project / creative constraint, such as, I have 4 speakers so I’ll choose 4 clusters…

yogi · December 18, 2025, 10:24pm

My intent is to slice and analyse archived recordings of percussive instruments, get clusters, then play slices according to their clusters. I mean when I trigger I trigger randomly from a cluster, and I want to have many clusters, each of them containing a variation of the same type of sound.

yogi · December 18, 2025, 10:29pm

Cheers, the question arose while looking at those 2d umap representations!

rodrigo.constanzo · December 18, 2025, 11:14pm

I’ve thought about this a bit too, but no concrete/usable ideas (yet).

One would think that with percussive sounds this would be easier (with MFCCs at least) since the sources will be fairly well defined.

Maybe some kind of thing where it tries a bunch of different amounts and then compares the cluster mean to the standard deviation and choosing the option where the ratio is most favorable?

If you haven’t tried it already dk.corpusclustermatch will do this for you, though it does require you to explicitly say how many clusters you want. (this video explaining sp.corpusclustermatch shows what i mean).

yogi · December 21, 2025, 10:30am

Cheers, this sounds like a good way to explore, I wouldn’t have thought of that.

I watched most of our videos while learning about Flucoma, great source of inspiration. Also I’m using Pure Data not Max,on linux, each time I want to learn about use with percussive material I look into your sp-tools for pd! Very instructive, I wish you implemented all sp/dk tools in Pd, I have a tiny laptop with Windows on it so I can check Max patches, on a tiny screen though.

rodrigo.constanzo · December 21, 2025, 10:47am

Let me know how that works. Sadly, computing the mean of a cluster its a lot more awkward than it seems. I’m sure this is not the optimal approach, but I remember struggling with this over in this thread, writing my process here.

I plan on adapting more, I just struggle when it comes to some of the sound generating things (e.g. polyphonic sample playback) in pd.

For dk.corpusclustermatch it’s fairly straightforward. The only fussy bits of the patch is the sorting of the clusters by different criteria. Independent of that, it’s just vanilla clustering from FluComa stuff, then selecting a random item from each cluster.

yogi · December 21, 2025, 10:56am

At least this is something I can develop/adapt for you!

I’d be stoked to help, let me know I for sure could help so you don’t spoil some time on tiny details specific to puredata

rodrigo.constanzo · December 21, 2025, 11:05am

That would be super amazing/helpful!

I don’t think I’ll ever make all the objects as it’s like 130+ at the moment, but it would be good to cover all the core functionality.

If you would be down for a chat at some point that would be amazingly useful.

For context, the biggest things I don’t understand in pd is polyphonic sample playback (ala polybuffer~) and how to load/parse .json files and dictionaries.