LTE - An alternative to LPT for people who don't care about "pitch"

Initially posted this on Slack, but wanted to move the conversation (thinking out loud) over onto Discourse.

Based on what @groma mentioned about the usefulness of higher coefficient count MFCCs to represent spectrum (and almost pitch if I understood correctly), has me thinking about what would be a useful analog to @tremblap’s LPT paradigm, but for drums/percussion. Or more specifically, a dimensional mapping scheme where “pitch” is not as significant a component as timbre or loudness, and only in as much as it is a subset of timbre.

For all the stuff I’ve been doing lately, I’ve completely left pitch descriptors out, though for initially pragmatic reasons. 256 samples isn’t big enough to get useful pitch information (or at least for the pitch range I’m working with on tuned drums).

But if I’m trying to streamline and get what is a, conceptually meaningful and perceptually bounded, descriptor space, how to balance those things together.

So like an “LT” space. Or maybe, for drums/percussion-related sounds, envelope and morphology is as important as loudness and timbre.

Either way, thinking out loud here and how to put together a multi-dimensional space that 1) isn’t so pitch-centric and 2) fits what I’m trying to do better.

That’s a slightly edited copy/paste of what I initially posted on Slack, but I wanted to move as much of the context over into this post as possible.

In thinking on this further, individual descriptors such as loudness and timbre are quite meaningful, but generally are about a fixed point in time (or rather, a window in time). And although we normally take some kind of (unweighted(!)) mean of that period as a kind of “summary”, this speaks very little about morphology, or time.

The AudioGuide solution to this is to take individual analysis frames, so there isn’t a “summary” of time as such, and then compare frame-by-frame what you’re looking for. I like that idea, but given my context (short analysis window/latency with long files) is a luxury I often don’t have.

Which is what led me to think of “E” (envelope) as perhaps an equally meaningful descriptor. I suppose that a better solution might be to have a morphology for each (macro)descriptor. So having loudness AND envelope of loudness, but if I’m thinking of a low dimensional space, is the envelope of loudness the same importance as loudness? Perhaps it is. I don’t know.

But mainly wanting to spitball and discuss here what may be a good way to have perceptually meaning descriptors which have equal (conceptual and perceptual) weight in the overall scheme of things.

LPT is a good paradigm, I think, but in my case it would probably be more significant to have a differentiation between something periodic/pitchy and aperiodic/noisy(or whatever), rather than to care whether something is an F or a G. Like, in the overall perceptual and creative space that I’m interested in, pitch is of low importance.

(this isn’t the case for straight mosaicking though, as you can definitely get some use out of pitch in that context, but that’s a separate discussion).

there are a few descriptors available for that, but my favourite dirty trick now is to use the pitch confidence of our pitch~ algo. It is dirty but works :wink: @groma might not think it is a good idea though, I did not ask him…

Yeah.

So in my context, the actual “pitch” of something may not be terribly important, but the fact that something is “pitchy” might carry more meaning. Even still though, I would be tempted to clump that under “timbre” territory rather than having an overall category of perception for it.

Actually remembered back to when I was experimenting with transducers a bit ago. I was using both confidence and flatness interchangeably and got good results with both. I don’t remember what I ended up using for that video, but this could also maybe get fused into a macro-descriptor where both are combined (and weighted) to get at how “pure” or “pitchy” something is.

I forgot to mention this in the original post, but there’s also a big distinction (for me) between online and offline versions of what this means. More specifically, “real-time” (JIT) vs pre-analyzed.

For real-time stuff, a lot of things aren’t available as analysis vectors at all (overall duration, meta-data/tags, etc…) and some things I don’t have a great idea of given the tiny analysis window (pitch, morphology). The latter might be mitigated some by the predictive thing, but that’s (literally) a separate discussion.

But there are some very useful things that can be present in a pre-analyzed file. All sorts of meta-data like names/labels/tags (which @tutschku makes good use of for his searching/querying) as well as some broader metadata like duration, amount of onsets, or “timeness”.

So all of that is to say is that having the perspective and context (,and time) there are more relevant (both conceptually and perceptually) dimensions for pre-analyzed files than what can be computed in realtime.

@rodrigo.constanzo I have been thinking about this also (if I’m understanding you correctly), and I have wondered about comparing the loudness (or amplitude) curve of a sample or analysis partition to various envelope shapes and getting a value describing how similar they are to those shapes, using shapes like a percussion envelope, sine window, ramped window , triangle, reverse percussion, no window at all (just flat loudness), etc.

I think basically you’d just have to do some normalizing of dimensions they a squared difference of the two.

It would probably often end up being most similar to the ‘flat’ window, but when it’s not, maybe that’s interesting to use.

Yeah something like that would be useful.

I guess that’s what I’ve been trying to get at with the linear regression stuff though that doesn’t really capture the complexity of a given envelope/morphology. It’s just tricky to have some idea of “time” in these kinds of analyses.

I don’t really know enough about vector/spline stuff, but I imagine there might be a better way to represent “shape” in a complex but low dimensional way.

What you’re suggesting could be a simple but effective workaround to that. Having a set of “known” envelopes that are available to match against. And particularly given my typical analysis size (256samps), it’s not like there’d be a massive amount of variation there.

1 Like

Now that I’ve tested some PCA speed comparison stuff, I want to revisit this to see how this fares. I’m still testing to see what descriptors/stats are most salient in the first place, as well as being aware that NaNs shit the bed down stream, so I have to be aware of what descriptors end up in the general soup, and I’ve come up with something that may be a useful conceptual anchor.

My thinking about this before was trying to come up with meaningful overall descriptors for a sound given that 1) I don’t care about pitch so much and 2) I have very a very small window (256samples) to work with.

I think breaking things up in to an LTE (loudness/timbre/envelope) is still an overarching idea, but given some of the discussion on confidence above, I want to sprinkle in a bit of P(itch), primarily to differentiate whether or not something is “pitchy” or not, but this is not nearly as significant as other dimensions.

So at the moment I’m spitballing this:

  • Loudness (4D) - mean, std, min, max
  • Timbre (4D) - loudness-weighted 20(19)mfccs, mean, std, min, max → robust scale → 4D PCA (or MLP)
  • Pitch (2D) - confidence-weighted median, raw confidence
  • Envelope (4D) - deriv of loudness mean, deriv of loudness std, deriv of loudness-weighted centroid mean, deriv of loudness-weighted rolloff mean

So that would give me a 14D space that encompasses the aspects of sound I’m interested in. We’ll see how well that works, but I have a hunch (or hope) that have an E(nvelope) vector could be interesting. Plus it incorporates some additional perceptually-meaningful descriptors (centroid/rolloff).

I did think about including individual analysis frames as part of the E(nvelope), as I only have 7 frames of analysis (at most) for my 256 sample window, but that wouldn’t scale up, and I think more generic contour descriptors like derivatives (or linear regression (or something even fancier)) may transfer from one “shape” to another, regardless of actual/fixed duration.

I still want to do that prediction thing where I query for the next 4410 samples which are analyzed in a similar way (or perhaps something slightly different), to then have two moments similar to @tremblap’s original approach. I would also do the same for the entire sample, though I wouldn’t be able to use that as an apples-to-apples realtime mapping.

The nitty-gritty of this will be a bit tedious unfortunately as it will involve a whole load of pruning steps along the way, particularly since all of these stats are non-adjacent, and as @tremblap unfortunately warns in the Example 11 thread:

If only there was another way…

That being said, I’ll start poking at this and post code/results when I get to the bottom of it.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////

A couple of questions to end this long necro-bump.

  1. Now that we have MLP, is that better (“generally speaking”) than PCA for large reductions (e.g. 76D MFCC space down to 4D T(imbre) space)
  2. In Example 11, there are normalization steps after most steps. If everything is going into fluid.robustscale~, is that strictly necessary?
  3. Now that we have fluid.robustscale~, is that better (“generally speaking”) for prepping data for PCA(/MLP) → fluid.kdtree~?
  4. Does the fact that fluid.robustscale~ is median-centered (as opposed to mean-centered) become problematic for PCA (mean-centered) or MLP (-1 to 1 activations)?

You know what I’m going to say :laughing: There just isn’t any generally speaking here: it depends on the data, and what you want to do with them.

  1. Using an MLP (presumably as an autoencoder) could, in principle, discover richer ways of reducing your space than PCA. But this hinges completely on the data it gets trained on, and some measure of luck that the features in the reduced space end up making sense to you (which seems to be your preoccupation at the moment). The alternative, of course, is to go supervised, and learn a regression to a labelled 4D space (but that depends on you doing some labelling). PCA still has a role to play, I think, in inspecting your data and seeing what you really need (checking that there’s some structure throwing away correlated features etc.).
    (My recent python experiment with the Principal Feature Analysis idea from you and Ted’s thread didn’t give me much faith that PCA is useful a priori for choosing one feature over another, but that its process of running KMeans on the bases matrix could be revealing for seeing how some of the features clump together, and so at least give an impression of some that could be discarded. I’ll work this up in Max/SC in due course: I don’t think it’s earth shattering, but could provide interesting insight).

  2. I think there was a reason @tremblap had a lot of independent normalizations applied separately to features in that example (perhaps to make recombining them afterwards easier?). But no, if you’re decided on using a particular scaling approach across the board, you don’t necessarily need them

  3. Again, it depends on the data, and there isn’t a generally speaking. The important thing for KD tree is that the ranges are comparable, so that each feature gets a fair shout in the distance measurement. Which normalizing process you opt for depends, in part, on how outlier-y the data is and what you want the consequences of that to be. If you have lots of outliers, then normalize / standardize will tend to compress the range that the bulk of your data occupies, in order to make room for the bandits (but these will, at least, have vaguely predictable bounds).

  4. (a) No, because PCA is going to mean-centre the data anyway: if the distribution is massively non-gaussian, then its results will have violated some of its core assumptions, but probably not catastrophically (b) For the MLP, no (but not all activation functions are -1:1)

1 Like

Not long after writing this I remembered how faffy setting up the autoencoder was, and how, as of yet, I’ve gotten anything converge at all. So the prospect of trial-and-error training an autoencoder (potentially for hours/nights) for each corpus(?!) doesn’t sound very exciting. So for now I’ll go with PCA and see how I get on.

Unless I’m mistaken, these two (PCA/MLP AE) are the only options for transformpoint-ing things in realtime?

Ok, that makes sense. I think for my purposes I’m going with the robustscaling for now, to have a focus on the “middle chunk” of things, with outliers going off however they’d like.

By this I meant, does MLP break if you set @activation 2 and give me numbers that are >1 or <-1, does that fuck up the convergence. Which was similar to my concern for fluid.kdtree~ where I didn’t know if the tree got busted if you had wonky/misshapen data (in as much as it would compute distances at all, not necessarily with regards to useful distance matching).

////////////////////////////////////////////////////////////////////////////////////////////////

I guess my desire to come up with some somewhat robust/generalizable analysis/processes is that the faff for setting set up each is so significant that I wouldn’t want to have a bespoke corpus (and corresponding realtime) analysis scheme for each corpus based on “what the data looks like” in that specific corpus. Even if it’s not perfect, having something that I can use to analyze arbitrary corpora and feed it arbitrary input would go a long way to being able to use this part of the toolbox.

I’m nearing on my third (maybe 4th) hour of trying to build the (seemingly) straightforward analysis scheme I outlined above. I think I just got the bottom of the T(imbre) step, but only in terms of the buffer~-based processes. It’s not yet gone to a dataset → robustscaling → PCA yet… (I realized I need to craft the whole scheme first, then fill up a dataset, figure out the fits for the robustscaling/PCA to then go back and apply transformpoint to the realtime equivalent).

It oughtn’t be that tricky to get something to converge (useful results is different). Always (always) start small, with low learning rates, and embrace tweaking the LR (to try and find a sweetish spot between not-converging and converging-but-too-slowly) as part of the model design. Be wary of making the network bigger, unless you’re sure you have enough training data to absorb the extra complexity (and can live with the extra computation).

I think UMAP’s had it since Alpha 07.

No, it shouldn’t screw up convergence in and of itself: if the whole range of your feature is huge, then that might give odd results, but in general points outside -1:1 are fine (consider that the network is learning weights to apply to the inputs anyway, so these weights can just get smaller, within reason).

It doesn’t screw up the KD tree either. The tree doesn’t really care about the range of data in absolute terms, but the euclidean distance it uses implicitly assumes that features are comparably scaled (euclidean distance = sqrt of sum of squared distances of each feature for the two points being compared).

I think some of this will partly be down to it being a new way of working (and there still being unsanded UX edges to the toolkit, natch): to a very large extent ‘programming’ with ML stuff is about data monkeying as much / more than it’s about models. I’ve certainly seen the argument put forward that it constitutes a completely different programming paradigm.

1 Like

I need to give this a good college try I think as, at best, I’ve left it running for hours with some generic-ish settings and that went nowhere. Obviously not an ideal way to go, though I did use a slightly modified version of @tutschku’s looping patch from the last plenary as a starting off point. I just haven’t gotten my head around all the parameters and how to massage them in a way that works.

(Much like the dance of “pick the correct descriptors with your fleshy human brain so I can tell you you chose wrongly”, it feels like picking parameters for this stuff is pressing a button that says “change random numbers” and then the computer buzzes you and says “wrong, try again”, over and over until it, somehow, magically works. (Surely unsupervised parameter selection for supervised machine learning is a thing?!)).

Ah right. Forgot about that (one’s ability to transform point). I’ll compare and update the speed comparison thread accordingly.

All good to know. I remember running to some issues with that (see point1 about lack of convergence) when we first got these as I had no idea what ramifications of the @activation param were (ala fluid.mds~'s zesty @distancemetric).

I can totally see that. I don’t want to derail a very useful conversation here, but I’d say 80% of my faff/friction at the moment has to do with getting the specific numbers I want, in the places that I want them. So I’m hardly even at the point of legitimate confusion, even though I’m quite obviously confused a lot (this thread included).

But indeed, the paradigm of moving around and transforming huge fields of numbers, with every tiny thing mattering in a way that is (often) unintelligible by humans, is pretty hard to decipher. Particularly with how “it depends” things can be.

Actually, is this a thing?

For example, a NN could expose the individual node weights and multipliers, but it would be ridiculous to try and set those manually. So is it possible to just zoom out the gradient descent thing so that the initial parameters for the algorithm itself are randomized and iterated until convergence happens? I suppose that can be a very slow process if it picks some horrible starting params, but with short enough iterations and changes, it could presumably evolve to something more useful all on its own.

If it is, I would love that…

Just have an object called fluid.stuff~, and you send it a message makeitwork, and 2 days later you come back to a converged network, and somehow 0.3333 of a Bitcoin too…

It’s kind of a thing, yes. See, e.g. 3.2. Tuning the hyper-parameters of an estimator — scikit-learn 0.24.1 documentation

But, still, fleshy brain is remains responsible for gathering the good quality data to feed such a beasty. Often, the time and computation investment to do this sort of thing is overkill, and a disciplined manual approach can get you somewhere useful quickly enough, once the various moving parts make more sense. In this post, and the one after I pointed Alex towards some meatier guidelines and explanations. Bottom line though is to start with hidden set to the smallest network you can get away with (i.e for an autoencoder, one layer the size of the reduced space you desire), and a minuscule learning rate (like 0.00001 minuscule), and don’t worry overly about the other parameters until later.

2 Likes

The idea was to keep a sort of perceptually equivalent scale, so 1dB is 1semitone of pitch is 1 semitone of centroid. Simple to understand the relations in a musician space, but timbre is super limited there… trying to shrink an MFCC space to 1D of 100points (the range in dB and semitone of useful loud and pitch) is still on the table, via kmeans first. This will happen when I get the headspace which is soon I hope

Oh yeah, I remember that discussion. I don’t think it happened on the forum at all, so I’ll make a thread for it now.

With regards to the scaling, at the end of each my processing chains (except Timbre) I’m ending up in “natural” units (e.g. linear amplitude or MIDI cents), which I can presumably scale one to the size of the other, and those would be alright. I can obviously properly normalize (or robust scale) things afterwards if I want the entire range being used, but I guess that’s more an aesthetic choice.

I’ve been thinking about this again over the last few days, in light of some of the info from @weefuzzy in this thread and some of the comments from @tremblap during the Thursday geek out sessions.

I’m thinking of abandoning the E(nvelope) part altogether, since with the short time frame it isn’t massively descriptive. That being said, some of the clustering from it was alright, since it relied heavily on a mixed collection of means of derivatives. So those may be useful to keep, but perhaps moving them over to their hierarchical descriptor types.

What I’m also thinking about now is incorporating more vanilla spectral descriptors alongside the MFCCs, as well as lower order MFCCs, to create a more comprehensive T(imbre) space. I’ve done a tiny bit of testing with this, but manually assembling variations of descriptors/stats takes me a long time, so it’s a bit discouraging to code for an hour and see bad results, then code again for an hour and see bad results, etc…

I’m also rethinking trying to “balance” the amount of descriptors per archetype. So Timbre is potentially over represented with the amount of spectral moments and MFCCs available, so reducing that down is definitely worthwhile, or eventually doing some of that k-means clustering-as-descriptor thing that @tremblap has talked about. But Loudness, and much more with Pitch, doesn’t really have that many dimensions that make sense. With my short time frames, I could potentially forgo summary stats for Loudness and just take each frame, potentially alongside std/min/max and derivatives, so the loudness is as comprehensively represented as timbre.

For pitch, however, there’s only really one value that matters…pitch. Confidence is useful for forking or conditional matching (separate conversation), but as a raw descriptor, it’s perhaps better suited to describe timbre.

So unless loudness and timbre can get boiled down to a single number, and even then, it seems like a lot of information and detail is getting thrown out, it will be hard to have each aspect equally represented.

For 80% of my purposes pitch will largely be irrelevant since I don’t have too many pitched elements in the input sounds I’m using. There sometimes are, and when they are, I would like them considered, but that can be handled in a different way (biasing etc…).

Towards that final point, is it a viable thing to distort the space such that you have (as an example) 10d of loudness stuff, 10d of timbre stuff, and 1d of pitch, but the pitch descriptor is scaled up 10x such that it impacts the overall distance more. Does that just skew everything around in a different way than if you had 10d of pitch information?

IIRC @balintlaczko had done that in his patch. If not him, it might be @tedmoore. If neither, I’ll look further, I have notes somewhere of someone doing an even larger timbral space…

yes it is different. think in 2D. 2 x 1 cm is 2cm if they go in the same dim, but if they are orthogonal it is the diagonal so √2 - you can try it dirty now to test if that works 2 ways, and I’m devising examples for scaling with fluid.normalize and json you will love and hate.

Yeah I’ve done this. Lately, I’ve been using FluidSpectralShape + FluidPitch + FluidMFCC (often between 9 and 13 MFCCs). This all usually ends up in a MLP maybe even through PCA first, so it all kind of gets put in the wash anyway.

1 Like