Based on discussions with @weefuzzy, @tremblap, and @tedmoore both on the forum and geekouts is this idea of trying to boil down the data to a small amount of useful and descriptive data points.

Among the techniques discussed so far are using PCA/SVM to determine which descriptors in a dataset represent the most variance, comparing standardized/normalized MFCCs vs raw (as per @weefuzzy’s suggestion) to see if there is “noise” in the higher coefficients, or just qualitatively poking/listening to the clustering after each step of plotting.

Some of these are quite useful, and others I’m going to play with a bit more, but I want to know if there’s a better (automatic/automagic/programatic) way to go about verifying the data.

Up to this point, my understanding of the working paradigm is to “shove a bunch of stuff in” and then let The Algorithm™ (be it PCA, UMAP, MLP, etc…) “find the important stuff” for you. And that has worked up to a point. But now there’s differences between amount of MFCC coefficients, including/grouping/scaling different descriptors, amounts of “noise” being introduced at various steps of the process, etc… that kind of complicate approach of collecting a ton of stuff and letting it get sorted out.

So what are some other approaches and/or workflows for optimizing stuff like this? (short of manually testing every permutation, which can be tedious, slow, and ineffective).