Ways to test the validity/usefulness/salience of your data

Good to know.
Yeah, the loudness stuff I think is quite useful to add in, and isn’t too big a faff if I’m doing “all the descriptors”. My original patch got a lot messier as I was peeking/poking out individual stats and scaling them etc… So it’s much easier just to slap a @weights on a fluid.bufstats~ and call it a day.

That’s definitely the medium/longterm plan. In terms of the code I had already in this patch I was looking at some loudness scaling, but I have experimented with confidence scaling as well. I haven’t yet found an ideal implementation of that as I suspect a combination of loudness and confidence in combination will suit more of my use cases.

I literally have no idea, but I remember that being an important distinction at the time. I think I chatted with @tremblap about it in this thread a while ago. I think robust scaling had freshly been implemented so it was all the rage at the time.

That’s part of the question as I’m not entirely sure how to best go about it. If I was just standardizing everything, I could presumably flatten/concatenate everything together and then standardize it all at once?

Me too!

I’m still leaning towards a conceptually-relevant space, or at least something that isn’t bespoke to each corpus. I guess a medium-term solution would be to run the PCA->UMAP on a bunch of different corpora at the same time and take the columns/scalings it gives me as a “standard” I would then apply to everything as a swiss-army-knife of descriptor soups.