So not sure if this is expected behavior or not, but it feels like something is breaking and it should perhaps report back in a different way.
I’ve found that when setting the amount of clusters fairly high (it varies per dataset and fitpredict-ing since it’s not deterministic), but at some point rather than breaking things into more clusters fluid.kmeans~ instead reverts to 0 clusters:
Here you can see I’ve asked for 50 clusters, which is obviously too many clusters, but you can see rather than finding a bunch of clusters, it has made just a single one (see printed labels in the console).
If I spam fitpredict a few more times, it then changes to 4 clusters:
So it feels like there’s some kind of over/underflow here, where when this happens, it should likely report back “not enough clusters found” or something like that, rather than returning a single, or 4 (default?) clusters.
The actual use case here isn’t as ridiculous as this, but it’s easy enough to demo with the fluid.kmeans~ help patch. In my actual SP-Tools patch if I set clusters to 10 or 11, this can happen.
I think I know exactly why this is happening and there will be an added option for kmeans in due course that will make it much less likely.
(The non-deterministic part of k-means is how the cluster centroids are initialized. Currently the object uses an initialization scheme called random partition. Sometimes this can fail with empty clusters, especially on data distributions that are pathological for it. Work is already part done to offer some different initialization options).
As to how to handle empty clusters. I think sklearn actually tries to fix them as it goes, but I’ll need to check again. But, failing that, I guess we could look at adding a warning, or implementing some kind of cluster quality metric you could query the object for.
That’s good to know, and I guess it makes sense that it can sometimes fail from random initialization.
Don’t know if what you’re cooking up takes that into account, but it would be great if it fails or finds “poor quality” clusters, that it can just iterate (internally) until it finds something that passes a quality metric (and throw up an error if it fails to).
In general, I’ve found that it can sometimes initialize poorly even with a few amount of clusters, based on the distribution of samples and the random seed. (I have an example with clear kick/snare/hat samples that works 99% of the time, but sometimes it breaks them up weirdly, and I have to load it up again).
So having a built in metric that makes sure it makes sense, and tries again if it rolls badly, would be super useful.
That’s unlikely because assessing cluster quality is a distinct step. So more likely to add a message to assess it, and leave any possible iteration up to the user. FWIW, the other initialization methods would possibly obviate the need, except in the most drastic of cases. One of them is much more reliable, albeit for some extra computation.
That sounds good. I’d sooner not have to manually faff/iterate things, and for the most part the kmeans happens “instantly” (as far as I can tell), so even if it takes 10 times as long, that’d still be “instant”.