So the idea, if I understand it correctly, would be to use KMeans clustering to find 100 clusters in a dataset of MFCCs+stats such that each cluster would represent a “unit” of Timbre.
This seems like a great idea, but a few things occurred to me which I thought might be worthwhile discussion.
- The clusters would have no perceptual ordering to them at all. So even though it may be 100 unique timbres, the difference between
cluster2may be gigantic as (I believe) it’s arbitrary. I guess you can apply some perceptual-descriptor stuff (centroid/flatness/whatever) to the clusters and sort them that way, but that seems like it would re-introduce some of the maths problems of centroids that using MFCCs can largely mitigate (e.g. having a “scooped” timbre returning a central centroid value).
- The Timbre space would be inherently “quantized” now, which if your other dimensions are still resolute (Pitch/Loudness) shouldn’t matter too much, but it still feels like you’re throwing away a lot of data, particularly if you have a (very) large dataset. (i.e. the differences in timbre between points inside
cluster5may be quite significant, but invisible to the reduction if you just ask for “100 clusters”.
- What do you do if you have <100 or >10000 amount of points. On the smaller end (my initial concern), do you just end up pulling the whole Timbre dimension “down” if it only contains, say, 20 clusters (so 1-20) vs 100 of other units (dB/MIDI). Or if you have a lot of points, you can end up quantizing things.
- Perhaps asking for a % of your amount of datapoints, then scaling your range such that those points fall on a scale of 0-100 (or whatever your “overall range” is). So if you have only 20 clusters it would then go through an equivalent of
scale 1 20 0. 100.) and similarly if you end up to 500 clusters you get
scale 1 500 0. 100.).
- This process would, in effect, reduce Timbre to a single number/dimension. That’s pretty cool. BUT other dimensions benefit from having more dimensions (a few stats for Loudness and Pitch). I guess you can be quite brutal and prune Loudness down to the single Loudness value, and Pitch down to a thresh’d Pitch value, but that also throws some baby out with the bathwater.
So yeah, just wanted to jot some ideas down while it was fresh in my head. I’m sure there’s plenty more to think about with this approach.