Hey there! I am working on something similar now, but I cannot really confirm yet if it works. My dataset consists mostly of short sounds (short scratching gestures on various objects), and here is my list of descriptors that I am trying to use now:
- length
- loudness stats
- attack loudness (first 100 ms)
- attack strength (attack loudness / mean total loudness)
- spectral shape stats
- grain density (length / graincount)
- spectral grain density (length / spectral graincount)
- transient density (length / transientcount)
- tonal strength (mean loudness of harm / mean loudness of perc - hpss)
The idea with the “grain density” stuff is to use the ampslice, noveltyslice and transientslice to get an idea of the grittiness, or granularity of the sound (and maybe some vague spectral morphology). It might be BS, I have to test and see. When I have the feature set I UMAP it down to 3 dimensions, at the moment it looks like this. There will be some spatial granular synthesis involved, that’s why I wanted the 3D.
I could never get a good intuition on using MFCCs so far, so I am kind of avoiding it… I tried earlier to use them as general descriptors, but it always turned out that I could have had the same result (at least on my dataset) with a lot less and more targeted descriptors (like loudness, centroid, or flatness). But then again, maybe a high-res MFCC + UMAP combo would make a lot of sense for a “general purpose” application.