Ways to test the validity/usefulness/salience of your data

rodrigo.constanzo · February 25, 2022, 11:15am

Ok, so I’m trying to build a @tedmoore -esque PCA→UMAP to see how that behaves in a somewhat measurable context.

At the moment I’m taking:

all loudness descriptors/stats (1 deriv)
20 loudness-weighted MFCCs with all stats (1 deriv)
all loudness-weighted spectralshape descriptors with all stats (1 deriv)
loudness-weighted pitch descriptors with all stats (1 deriv)

I’m not 100% confident on some of these things (e.g. loudness-weighted “confidence” in the mix with pitch, there’s a kajillion MFCC dimenions now, etc…), but it’s a jumping off point.

Now looking at my old LTEp approach I applied robust scaling to everything except MFCCs, which I standardized instead. And my workflow was to flatten each brach of the analysis and then post-process (robust scale/standardize) the datasets individually before concatenating them into a single larger dataset (this step was actually really unpleasant to do, so let me know if this is easier to do now than cascading together a bunch of dummy datasets).

Is this, more-or-less, your workflow (@tedmoore)?

From this step forward I plan on doing the PCA→UMAP thing to see what I get from the whole big mess of soup. Firstly just to browse and compare how this fares against the LTE approach with a more hand-picked/conceptual descriptor space, as well as trying to apply the same transformations to tiny sample windows (256samps) to larger ones (4410samps) and see if I can regress between the two.

I do have to say that having native @unit attributes in places makes some of the coding here much easier than before (previously I was unpacking and manually massaging the spectralshape descriptors I wanted to be in the “correct” units), and not having to care about pulling individual columns out also helps, but it’s still not a very pleasant coding experience/workflow to put together an analysis chain like this.