So to kind of bump this here as it’s starting to overlap with some of the explorations I’ve been doing in the time travel thread.
Based on some super helpful input from @jamesbradbury I was starting to do some more qualitative testing where I would play the target, and listen back to the nearest matches, very similar to what @tutschku was doing in this thread.
I went through and pulled out the bit from @tutschku’s patch above that does the Orchidea MFCCs and painfully parsed it into a
fluid.dataset~ (I’m not a big bach user, so this involved pasting things into a spreadsheet app, and dumping things through two
colls etc…) and made a short video comparing the results.
So this is using my tiny analysis windows where for FluCoMa stuff the fft setting are
256 64 512 and for Orchidea it’s just 256 64 as I don’t know how to do oversampling there. Also don’t know how it handles hops or anything like that either, so that might be something to explore/unpack, but even with that, the match accuracy is on par with what @tutschku was getting. So I’m inclined to believe most of that magic is happening in how the MFCCs are being processed, rather than which frames I’m grabbing.
So here’s the video:
I mention it in the video, but this is showing kind of ‘best case’ performance from the FluCoMa stuff. I’ve done loads of different permutations (as can be seen in the other thread), but all of them are roughly in this ballpark. For this video I’m using 20MFCCs, ignoring the first, then adding loudness and pitch. I then take mean, std, min, max for everything, including one derivative. And for the Orchidea I’m taking only the 20MFCCs, nothing else.
So yeah, the results are really striking. Particularly in the overall clustering. This is a testing set of around 800 hits, so big, but not gigantic, and you can really hear how well it handles all 4 neighbors. The FluCoMa stuff does pretty well (though not great) for the nearest match, and maybe two, but the overall “space” that it’s finding isn’t as well defined as with the Orchidea MFCCs.
I’m wondering now if part of the “energy weighted average” that @weefuzzy was wondering about is like what @b.hackbarth has been doing in AudioGuide, where the MFCCs are just weighted by loudness in each frame (perhaps also combined with some standardization across the whole set, rather than per-sample), which is producing better overall results here.
Either way, wanted to bump this as I’m now seeing how accurate/useful this is here, and how to best try to leverage whatever “secret sauce” is going on underneath to get similar results in the