A little bit of an update on this.
Following some super awesome help and thoughts from @jamesbradbury by testing some permutations in his FTIS I decided to try a different approach. All of my testing so far has been primarily quantitative. Literally checking how well stuff matched, to try to minmax the best results with what I could manage.
@jamesbradbury set up a thing where it would play the test audio and then the nearest match audio, so you could hear them back to back. This was useful, so we decided to go full on and implement a @tutschku thing, where it plays the target, then the 4 nearest matches to hear the overall clustering/matching.
So with his large setup of analysis stuff (20MFCCs, all stats, 1 deriv), things sounded very good. Quite solid clustering. And even after some fairly aggressive reduction via UMAP (in Python), the audible matching was still fairly solid. (UMAP is pretty fucking slow though, in Python at least)
So what I’ve been experimenting with today is creating a more ‘real world’ training data set with hundreds of different hits at different dynamics, with different sticks, various preparations/objects, etc… I hadn’t done this before as I can’t really (easily) verify the results of this quantitatively since I would need a corresponding/fixed testing set, which would take forever to make. BUT using this @tutschku approach, I can just create loads of training hits and testing hits and then listen for the clustering.
And re-ran my tests from before and the results are interesting… Even though I was getting a solid numerical (and audible) match for the nearest match, the overall clustering wasn’t very good.
So I need to go back and try some different permutations to see what gives me the best overall sonic matching/clustering.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Oh, and bumping this thread wouldn’t be fun without some more statistics…
I went back and tried a few other permutations which I hadn’t tried yet, along with including some different descriptors.
Since I got pretty good results going with a lower amount of natural dimensions I tried reducing it even more and got pretty good results with just 19d by taking only the mean of the MFCCs.
20MFCCs - mean only: 50.6% (19d)
I then tried including loudness and pitch into the equation, thinking that it might be useful to have that for a bit of extra matching on those criteria. If i did 20MFCCs with mean and std for everything, including loudness and pitch I got the following:
20MFCCs + loudness/pitch - mean/std: ** 54.2%** (42d)
And if I remove the std, I get a very respectable matching accuracy with a low amount of dimensions (21d):
20MFCCs + loudness/pitch - mean: 54.8% / 53.6% / 53.1% = 53.83% (21d)
I should also say that this is with non-sanitized values (so MIDI for pitch and dB for loudness), so that kind of skews the knn stuff, but this was a quick test to see the potential viability of this.