I’ve been thinking about this again over the last few days, in light of some of the info from @weefuzzy in this thread and some of the comments from @tremblap during the Thursday geek out sessions.
I’m thinking of abandoning the E(nvelope) part altogether, since with the short time frame it isn’t massively descriptive. That being said, some of the clustering from it was alright, since it relied heavily on a mixed collection of means of derivatives. So those may be useful to keep, but perhaps moving them over to their hierarchical descriptor types.
What I’m also thinking about now is incorporating more vanilla spectral descriptors alongside the MFCCs, as well as lower order MFCCs, to create a more comprehensive T(imbre) space. I’ve done a tiny bit of testing with this, but manually assembling variations of descriptors/stats takes me a long time, so it’s a bit discouraging to code for an hour and see bad results, then code again for an hour and see bad results, etc…
I’m also rethinking trying to “balance” the amount of descriptors per archetype. So Timbre is potentially over represented with the amount of spectral moments and MFCCs available, so reducing that down is definitely worthwhile, or eventually doing some of that k-means clustering-as-descriptor thing that @tremblap has talked about. But Loudness, and much more with Pitch, doesn’t really have that many dimensions that make sense. With my short time frames, I could potentially forgo summary stats for Loudness and just take each frame, potentially alongside std/min/max and derivatives, so the loudness is as comprehensively represented as timbre.
For pitch, however, there’s only really one value that matters…pitch. Confidence is useful for forking or conditional matching (separate conversation), but as a raw descriptor, it’s perhaps better suited to describe timbre.
So unless loudness and timbre can get boiled down to a single number, and even then, it seems like a lot of information and detail is getting thrown out, it will be hard to have each aspect equally represented.
For 80% of my purposes pitch will largely be irrelevant since I don’t have too many pitched elements in the input sounds I’m using. There sometimes are, and when they are, I would like them considered, but that can be handled in a different way (biasing etc…).
Towards that final point, is it a viable thing to distort the space such that you have (as an example) 10d of loudness stuff, 10d of timbre stuff, and 1d of pitch, but the pitch descriptor is scaled up 10x such that it impacts the overall distance more. Does that just skew everything around in a different way than if you had 10d of pitch information?