Example 11 - verbose

Ok, this one is a doozy, and it took me a bit to figure out what is happening (with a happy crash along the way). It looks quite exciting, though sadly it is not easy to test with your own sounds sends specific durations are pre-baked into the 2nd subpatch (and consequently the 3rd subpatch), so I’ll play around more with the “examples soup”.

So in terms of the final subpatch here, you’re essentially building a 12D LPT-kind of thing, with 4D per macro-feature (“loudness”, “pitch”, “timbre”). I like where this is going!

I’m having a little trouble following the dimensionality reduction stuff, and which is being used where.

So in 1st major subpatch, the MFCCs are standardized before PCA (I thought PCA liked normalize? either way…). And then in the 3rd subpatch, everything is normalized (including the output of the MFCC->standardize->PCA). Is that correct? It’s a little tricky to follow with all the recursive object usage.

Lastly, the “weighting” bit at the end, which I believe has copy/pasted ‘clue’ text from elsewhere. You’re just rescaling a specific feature inside the normalized space? In this case, making “pitch” go from -4.5 to 4.5, vs everything else being 0. to 1.?

Very much a work in progress (I finished translating PA’s SC version last night), so I’d hope that both the intent and patching get clearer as we work stuff out.

Standardising is canon here, but normalising is better than nothing. What PCA really likes is for the data to have 0 mean, and it doesn’t hurt if their ranges / variances are in the same ball park

Yes, although I don’t quite follow recursive here. Because I wasn’ t quite paying big picture attention whilst doing the Max version, I hadn’t noticed that we’d need two of the objects from the first subpatcher later (pca and standardize), hence the sends there. The rest of the sending shenanigans is to cut down on patch chords.

I think @tremblap’s intuition with the normalising was that rescaling the pitch upwards should de-prioritise it in the KD tree look up (because the relative distances would be greater). As with Example 10, probably best to view this as a hypothesis in progress…

1 Like

Ok, good to know.

So that basically means we’re looking at:
[audio]->[descriptor(s)]->[weighting]->[stats]->[flatten]->[prune]->[standardize]->[pca]->[(merge with others)]->[normalize]->[query/whatever]
?

It was just that. Thankfully all the clues helped, but it was tricky to follow since things jumped around and were reused in places. Plus all the ;notation too!

Is this in response to the weighing (weighting? (actually, are we using weight/weighing to mean two separate things now? (e.g. loudness-weighted descriptors vs weighing parameters in a query))) question or to the normalization stuff, or just a general comment?

In response to:

But

Are they two different things?

I guess I meant if there were different terms for those use cases (scaling frames for producing a more perceptually-meaningful summary vs biasing a query towards a specific feature(s)).

Yes but with a few major differences: there is no time division, which was the strength of LPT. I think it is time for me to be candid about this patch in writing. I’ll do that just below, but now I’ll answer the specifics first:

Yes. The process you describe is a little short of the truth, so here goes in pseudo-code:

//analysis
for each slice:
  - take the pitch, process like in example 10b (weighed by stingent confidence, thresholded, resulting in a very sparse dataset of valid entries, but we know they are valid. Put 4 dims in PitchDS as is.
  - take the loudness, put that in LoudDS, 4 dims, as is
  - take MFCC, weigh coeffs 1 to 12 (scrap 0) by the loudness from a high ceiling of -70LU, put the same 4 stats than above on these 12 coefs in a MFCC-DS, that you then standardise, and PCA to 4dims into TimbreDS

//assembly of weighed DS and its query
for each slice:
  - normalise the 3 ds - this is to scale their relative euclidian distance as pointed to by Daniele.
  - put in a tree

//for querying
- analyse the target as each item above including (std and pca for target mfcc)
- normalise each of LPT according to the coeffs in the assembly
- query the tree
1 Like

Now, on this: there are so many variables one can play with, it is crazy. So this is a research in progress, trying to find an MFCC space that is (personally, perceptually) as “reactive” as a valid pitch and loudness one. But as I do the tests, I discover a lot of assumptions in my thinking on what is an accurate match. It is incredibly anchored in a fleeting musiking need more than anything objectively nearer. So I keep plouging with my research, now that I (we) have the tools to do it accurately, within Max and SC.

Things I am going to try next in this example 11:

  • scale on 0-100 instead of 0-1 as a baseline. or maybe -50+50. It feels easier to grasp.
  • make a graphic example of the assumption of the scaling the distance = lowering the impact in 2d, to see if that works as well as in my head

I am writing a fixed media piece with this, and will try in action, and in comparison with my MFCC-musaiking my crazy synth and my analog synth (both examples I provided people with) and the APT…

On the next horizon, in months:

  • a sort of branching version of it all
  • a sort of MLP based mapping between a bass analysis and a corpus space, including timbral space.
  • removing all pitched component and using the noise only in the corpus and passing on the pitch of the target (a sort of cheap corpus-based vocoder)

We’ll see where it all goes. I also look forward to see what people will do with the current tools - there are a lot of possibilities!

1 Like

I guess if you want equal weighting, you can just use 36d instead of 12d, with the same time series involved. Or maybe something more AudioGuide-esque and have equally weighted macro-frames to capture the morphology more clearly.

I was wondering about this, and I guess this is something you tried to explain in the last geek out. So with this approach, you may still query for pitch, but everything that was dismissed/deleted will create a malformed query now?

I guess this is a workaround to having pitch “default” to being low, and centroid “default” to being nyquist or whatever it was you did.

I like the idea, but then wonder about how to deal with that in terms of interface (spamming errors, any errors?, etc…).

Out of curiousity, how much of the testing has been done with arbitrarily-lengthed segments? As @tutschku (and I) have mentioned, I find it next to impossible to gauge “timbre” between a tiny sample of audio (100ms) vs 3seconds+ fragment of “music”. A lot of the examples segment into these big (and small) chunks, and I can’t really tell since I don’t “hear in means”.

Is this just conceptual or is there some technical benefit for (presumably) keeping dB and MIDI in their “natural” units?

Yes please!
Some of the patches are easy to swap out and change bits, but most of them presume fixed dimensions and features, making it very difficult to test with different amount of features/stats (since all the query/flatten stuff has pre-baked indices).

Interesting. Do you mean using something like what @spluta has been doing by navigating a corpus space (ala sandbox3) with “live” bass audio?

By “passing on the pitch”, do you mean applying the pitch to the target? So you find the nearest match for loudness/timbre in the normal ways, then literally map the pitch of the source to the matched grain/segment?

Don’t forget: indiscriminate dimensions are worse than no information. So I think my new approach will be to care about loudness pitch and timbral time series independently. I really don’t care about the pitch of the attack, for instance.

I think the workaround is more in line of dismissing the info. I’m thinking of branching with pitch confidence. It is not clear yet in my head…

My tests are done with the example I provide, but mostly with my own synth test dataset which is a lot more in similar length chunks… but all that varies in context, so just poke at it and have fun!

My current irrational intuition is that latter… but hey, I need to try it.

I mean what i do in SB3, but mapping to a curated set of parameters - so I would curate the mapping and try to MLP the relations I have made. semi-supervised learning style.

no I literally mean to pass the pitch material from one and apply the volume and noise/timbre. again, sharing ill-formed creative coding ideas. This will materialise or not in the example folder.

let’s see…

1 Like

With this I meant 12d per time frame. So everything exactly as you’re doing it, but for the first 50ms, then 150ms, etc… So 36d being 3x12d (rather than an arbitrary “36d”).

Great, I love it. Very much looking forward to seeing how you tackle this as I’m in a similar boat (with the complications of super short analysis windows, and generally longer samples).

I’ve got a question about the 0th coefficient in example 11.

I don’t see where in the patch the 0th coefficient is getting ditched.

In p weighted featureExtractor we have this:
Screenshot 2020-10-03 at 10.59.40 pm

followed by this:
Screenshot 2020-10-03 at 10.59.48 pm

Which then produces a 168d dataset (instead of a 182d one).

Is it the @startchan 1 in the fluid.bufstats~ that ignores the 0th coefficient?

Is it the @startchan 1 in the fluid.bufstats~ that ignores the 0th coefficient?

Yes. Maybe I should have annotated it ;-D

1 Like

I was confused to shit, and started this post, when I noticed that. THere’s just a bunch of @numchans kinds of stuff peppered throughout, so it looked like any other of these.

1 Like

There’s a single @blocking 0 in one of the fluid.bufstats~ in the processing chain inside p weighted featureExtractor.

Is that there for any meaningful reason?

If I run my batch analysis stuff with this (it’s more individual file/polybuffer-based, rather than make-a-big-buffer-and-segment), I get loads of this in my console:

Batch Training: 3
Batch Training: 4
fluid.bufstats~: already processing
Batch Training: 5
Batch Training: 6
fluid.bufstats~: already processing
Batch Training: 7
Batch Training: 8

If I remove the @blocking 0 it appears to work fine, but I didn’t know if there’s a specific technical reason for it being there that I’m overlooking.

It shouldn’t be there in the example, no (because there’s no advantage to lots of threading when you have lots of wee slices of a big buffer, partly down to an inefficiency I hope to solve one day).

If you’re seeing that message, then this is because it’s retriggering before the asynchronous process is done, which I’d guess is because you’ve got some sort of [t something something] above the processor, waiting on it. This configuration won’t work when there’s non-blocking stuff (you’d need to wait for the bang out of fluid.processsegments instead)

1 Like

Cool, I’ll remove it. I just didn’t want to later find out that for some specific funky edge case it does something that isn’t immediately obvious.

That’s weird that I’m getting that message there as this is in the context of a defer'd loop. I guess because it spawns its own thread (in @blocking 0) that that puts it outside of the remit of a t b b b to manage order of operations.

As a side question for this, is there any (speed/cpu) different for having all the processes in a serial chain (like ex11) vs having them in a “parallel” t b b b context? Obviously the loudness needs to happen before the mfccs for the weighting, but I’m never sure in circumstances like this what is better to do. (I’ve done parallel for all my patches so far, so the serial processing here stands out to me).

(sorry to keep bumping and asking questions on this one, it’s a bit tricky to follow in places)

Following on with this bit. The “timbre” 4d space is made up of a PCA’d version of stuff (which happens inside p explore mfcc matching, and which I can follow).

For pitch and loudness, it looks like you’re taking the first four columns of each dataset which are mean of pitch, mean of confidence, std of pitch, std of confidence and mean of loudness, mean of peak(?), std of loudness, std of peak(?).

So for pitch and loudness there are no derivatives being taken into account (even though they are analyzed in the initial step). It’s literally just the mean and std of pitch/confidence and loudness/peak, respectively?

///////////////////////////////////////////////////////////////////////////////////////////////////////////////

Following on from that, was 4d-per-“descriptor” arrived at just from testing or poking? It seems like for pitch/loudness it may be useful to twice as many descriptors (the same four (if I understood correctly), along with their derivatives), and I would guess that PCA would be just as happy to make an 8d reduction rather than a 4d one.

in SC I’m taking the 4 first column which, because we are interleave channels are listed in alternance in the flatten buffer (fund, conf, fund, conf, etc) for each stats (mean, stddev,…) so taking the first 4 is

mean-fund, mean-conf, stddev-fund, stddev-conf

This is sharing research in progress to try to get people to try stuff, so keep on trying stuff and it does what it is meant to :slight_smile:

To give you an idea: at the moment, I am trying to make sense of MFCC to crazysynth to allow controlling the synth via MFCC. I did that during the plenary, sent the patch, it works… but not on my current source in my liking, so another mapping is needed… I’m exploring as I compose with the tools, and with @weefuzzy and @groma also in a similar mode now, I hope we will pollute this place with more half-baked ideas. A bit of divergence in the soup.