When doing loads of different types of compensation (loudness, pitch, spectral, morphological), it does start getting to a point where the source is arbitrary, and I’m not terribly interested in pushing it that far.
BUT I have found spectral compensation, in the context of corpus query/matching to be very expressive as even with a small amount of samples/options, you can still apply some timbral variety from the audio input. I do this presently with melbands which works ok, but gets very phase fucky. And seeing how effective the spectralshape-derived filter sounds, I want to do that in a better way.
But back to the core philosophical question, I have been thinking quite a bit lately along similar lines to your oldschool LPT idea where you can have some hierarchical “macro” descriptor types that can each be made up of bespoke deep-learned descriptors, or dimensionally-reduced, or whatever, and then be able to do any combination of querying with and transferring over with them. So decoupling querying from transference.
A simple example would be using descriptors that don’t encode loudness (e.g. MFCCs sans 0-th, or spectral shape) to do your querying/regression/classification/whatever, then independent of that choosing to apply loudness-compensation to the end result.
A more complex example would be to query with pitch/loudness and morphology (but not timbre), then apply spectral/loudness compensation to the end results.
Can also open up possibilities for clever treatment of pitch/chroma (ala what you suggest in your surfing the waves paper) where you can query with pitch, find the best example, then compensate via chroma/octaves/modulo/detuning etc…
I sent you a quick phone vid of some of the recent stuff I’ve been working on with a buddy, but we’re getting really good results with morphology as a low-latency descriptor (i.e. using timescale regressors to predict longer analysis windows/morphology from short ones).
////////////////////////////////////////////////////////////////////
As to the original question, I don’t think it will be possible to just numerically map like I was thinking in my previous post as each compensation would have a huge aggregate effect likely skewing things way off.