LTE - An alternative to LPT for people who don't care about "pitch"

rodrigo.constanzo · March 7, 2021, 12:05am

Now that I’ve tested some PCA speed comparison stuff, I want to revisit this to see how this fares. I’m still testing to see what descriptors/stats are most salient in the first place, as well as being aware that NaNs shit the bed down stream, so I have to be aware of what descriptors end up in the general soup, and I’ve come up with something that may be a useful conceptual anchor.

My thinking about this before was trying to come up with meaningful overall descriptors for a sound given that 1) I don’t care about pitch so much and 2) I have very a very small window (256samples) to work with.

I think breaking things up in to an LTE (loudness/timbre/envelope) is still an overarching idea, but given some of the discussion on confidence above, I want to sprinkle in a bit of P(itch), primarily to differentiate whether or not something is “pitchy” or not, but this is not nearly as significant as other dimensions.

So at the moment I’m spitballing this:

Loudness (4D) - mean, std, min, max
Timbre (4D) - loudness-weighted 20(19)mfccs, mean, std, min, max → robust scale → 4D PCA (or MLP)
Pitch (2D) - confidence-weighted median, raw confidence
Envelope (4D) - deriv of loudness mean, deriv of loudness std, deriv of loudness-weighted centroid mean, deriv of loudness-weighted rolloff mean

So that would give me a 14D space that encompasses the aspects of sound I’m interested in. We’ll see how well that works, but I have a hunch (or hope) that have an E(nvelope) vector could be interesting. Plus it incorporates some additional perceptually-meaningful descriptors (centroid/rolloff).

I did think about including individual analysis frames as part of the E(nvelope), as I only have 7 frames of analysis (at most) for my 256 sample window, but that wouldn’t scale up, and I think more generic contour descriptors like derivatives (or linear regression (or something even fancier)) may transfer from one “shape” to another, regardless of actual/fixed duration.

I still want to do that prediction thing where I query for the next 4410 samples which are analyzed in a similar way (or perhaps something slightly different), to then have two moments similar to @tremblap’s original approach. I would also do the same for the entire sample, though I wouldn’t be able to use that as an apples-to-apples realtime mapping.

The nitty-gritty of this will be a bit tedious unfortunately as it will involve a whole load of pruning steps along the way, particularly since all of these stats are non-adjacent, and as @tremblap unfortunately warns in the Example 11 thread:

If only there was another way…

That being said, I’ll start poking at this and post code/results when I get to the bottom of it.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////

A couple of questions to end this long necro-bump.

Now that we have MLP, is that better (“generally speaking”) than PCA for large reductions (e.g. 76D MFCC space down to 4D T(imbre) space)
In Example 11, there are normalization steps after most steps. If everything is going into fluid.robustscale~, is that strictly necessary?
Now that we have fluid.robustscale~, is that better (“generally speaking”) for prepping data for PCA(/MLP) → fluid.kdtree~?
Does the fact that fluid.robustscale~ is median-centered (as opposed to mean-centered) become problematic for PCA (mean-centered) or MLP (-1 to 1 activations)?