Using KMeans clustering as "dimensionality reduction"

I remember this coming up during the thursday geekout sessions, primarily between @tremblap and @tedmoore but PA mentioned it again in the LTE thread.

So the idea, if I understand it correctly, would be to use KMeans clustering to find 100 clusters in a dataset of MFCCs+stats such that each cluster would represent a “unit” of Timbre.

This seems like a great idea, but a few things occurred to me which I thought might be worthwhile discussion.

  • The clusters would have no perceptual ordering to them at all. So even though it may be 100 unique timbres, the difference between cluster1 and cluster2 may be gigantic as (I believe) it’s arbitrary. I guess you can apply some perceptual-descriptor stuff (centroid/flatness/whatever) to the clusters and sort them that way, but that seems like it would re-introduce some of the maths problems of centroids that using MFCCs can largely mitigate (e.g. having a “scooped” timbre returning a central centroid value).
  • The Timbre space would be inherently “quantized” now, which if your other dimensions are still resolute (Pitch/Loudness) shouldn’t matter too much, but it still feels like you’re throwing away a lot of data, particularly if you have a (very) large dataset. (i.e. the differences in timbre between points inside cluster5 may be quite significant, but invisible to the reduction if you just ask for “100 clusters”.
  • What do you do if you have <100 or >10000 amount of points. On the smaller end (my initial concern), do you just end up pulling the whole Timbre dimension “down” if it only contains, say, 20 clusters (so 1-20) vs 100 of other units (dB/MIDI). Or if you have a lot of points, you can end up quantizing things.
  • Perhaps asking for a % of your amount of datapoints, then scaling your range such that those points fall on a scale of 0-100 (or whatever your “overall range” is). So if you have only 20 clusters it would then go through an equivalent of scale 1 20 0. 100.) and similarly if you end up to 500 clusters you get scale 1 500 0. 100.).
  • This process would, in effect, reduce Timbre to a single number/dimension. That’s pretty cool. BUT other dimensions benefit from having more dimensions (a few stats for Loudness and Pitch). I guess you can be quite brutal and prune Loudness down to the single Loudness value, and Pitch down to a thresh’d Pitch value, but that also throws some baby out with the bathwater.

So yeah, just wanted to jot some ideas down while it was fresh in my head. I’m sure there’s plenty more to think about with this approach.

Yes this is what I explained in our meetings. But there is a solution: you can use the centroids of each cluster and rank them via PCA0. This is where I’m going when I have time to test this.

Why via PCA, vs just zl sort or something like that? Wouldn’t the idea be to have a sense of more-ness and less-ness to the trajectory?

the idea is to sort them but they are multidimension, these centroids… so how do you sort high dimensions stuff? you take a shot at PCA0 and hope it works will be my first approach. better than the random order in which kmeans will give me.

Ahh right. So not necessarily darkerbrighter, but an additional layer of dimensionality “reduction” on top of everything else.

each class ‘found’ by kmeans will have a multidimensional centroid. If you want your order of 100 timbral class to make sense, as you pointed above and from previous discussions too, you need to ‘rank’ them. So PCA0 will be my first choice.

Ooooh. So would presumably rank each overall cluster (cluster1, cluster2, cluster3, etc…) by centroid. And then within each cluster, you would then also rank those, so inside cluster1 you’d have cluster1a, cluster1b, cluster1c, etc…

Is that it?

not in this case. If you dump the kmeans, you’ll see each class has a multidimensional entry, which is the centroid of each point. if you rank those, then you have the 100 points. for now, that quantized space is enough (considering you are very likely not to have that many points per class to train with…)

1 Like

Ooooooh. I got you. Not a separate “spectral centroid” audio descriptor, but rather the statistical centroid of the numbers in the dataset…