Another noob question on scaling non-overlaying corpora

I can offer some of my experiences trying to do similar things.

As @tremblap mentioned, the descriptors (and stats) will have a big impact on what you find. So picking descriptors (and stats) that are conceptually (and musically) meaningful to what you want to differentiate. For example, pitch may be important in your bass/flute, but perhaps quite static in the corpus of pizza plate sounds. So you may get a bunch of bad/junk info in the corpus which the matching will try to make sense of, which in my experience leads to a “stuck note”-type sound since the bad pitch data tends to all come back as a single value (0 or -999 or whatever).

Similarly the amount of each type of descriptor/statistic matters too. Say if you take MFCCs and stats, you’re looking at hundreds of values right there. Whereas you may have only 1-8 (depending on settings/stats) for loudness, meaning that in a nearest-neighbor matching sense, the loudness is almost irrelevant in the sea of MFCC numbers.

Towards that end, dimensionality reduction can be useful (e.g. making a reduced “timbre” space out of the MFCC/stats), but this can be a whole rabbit hole in and of itself. There’s some experiments/tests in this thread here.

A simpler solution/approach is to try just using descriptors/stats at the start as those are, roughly, in similar amounts. And depending on the units you use, can also be in similar scale (dB and MIDI pitch for example where 1 “unit” is roughly/equally perceptually different). That will let you get an analysis/process pipeline going and test other parameters like the playback speed, grain size, and other things that @tremblap mentions.

As perhaps a not so immediately useful aside, in my experience I found I get better (audible) results by doing no scaling of the data spaces at all, even if they are vastly different. I tried with robust scale, standardization, and normalization too. Again, this is likely specific to my approach/descriptors/corpora/input, so take that with a grain of salt. But I did want to say that “getting it working without fancy stuff” is usually a good place to start.

You probably don’t need (or want) a regressor at all in what you’re doing. You can just scale your corpus, then create a scaling for your input space (then transformpoint individual descriptor chunks) and fit a kdtree with the corpus and just find the nearest point from your scaled inputs.