Hybrid/Layered resynthesis (3-way)

Ok, going to code something up with these numbers: 64/256/768, starting 512 samples late, and analyzing full windows for each bigger size.

I’ll see if I can get some querying and brutalist playback based on like-to-like, and then experiment with the temporal mismatch stuff.

Will post patches/results.

edit:
Sticking with the descriptors and such that I know (will save MFCCs for another time, as I already have enough dimensions going here).

Also leaning towards over-analyzing in each time window, seeing what works, and then revising from there.

But this is what I’m aiming to do:

Entries to store in the database:

index - index
name - name
duration - duration
time_centroid - time centroid
onsets - onsets

64_loudness_mean
64_loudness_max
64_centroid_mean
64_flatness_mean
64_rolloff_max (max90)

256_loudness_mean
256_loudness_max
256_pitch_median
256_pitch_confidence
256_pitch_optimized
256_centroid_mean
256_flatness_mean
256_rolloff_max

256_loudness_derivative (mean of deriv)
256_loudness_deviation (std dev)
256_centroid_derivative
256_centroid_deviation
256_rolloff_derivative
256_rolloff_deviation

768_loudness_mean
768_loudness_max
768_pitch_median
768_pitch_confidence
768_pitch_optimized
768_centroid_mean
768_flatness_mean
768_rolloff_max

768_loudness_derivative (mean of deriv)
768_loudness_deviation (std dev)
768_centroid_derivative
768_centroid_deviation
768_rolloff_derivative
768_rolloff_deviation

loudness_mean
loudness_max
pitch_median
pitch_confidence
pitch_optimized
centroid_mean
flatness_mean
rolloff_max

loudness_derivative (mean of deriv)
loudness_deviation (std dev)
centroid_derivative
centroid_deviation
rolloff_derivative
rolloff_deviation

1 Like

Ok, so I’ve been plugging away at this this week and have the offline analysis sorted, and today I finished the real-time version.

Other than the faff of creating all the buffers and double-checking all my @source and @features destinations, the offline part was fine enough.

The real-time version was trickier. In order to make things fast, and (hopefully) meaningful, I increase the amount of descriptors/stats I’m using with each step.

So at the moment the idea is that playback will begin after 512 samples have passed (from the onset).

The first analysis window is 0-64 and analyzes for the mean of loudness, mean of centroid, and max (90%) of rolloff.

Concurrently, another analysis window of 0-256 happens and analyzes for mean of loudness, mean of first deriv, median of pitch, mean and 1st deriv of centroid, flatness, and rolloff.

I then wait 256 more samples and analyze samples 0-768 and analyze all of the above, including standard deviation of all but pitch.

I wrestled with this a bunch since I wanted to keep each step fast (in terms of analysis time). At the moment, on my laptop which is a fair bit faster than my desktop) the 0-64 window takes around 0.14ms (average), 0-256 takes a jump up to 0.67ms, and 0.768 takes a big jump up to 1.5ms. On my studio computer the short window is about the same, but the other to push up where the 0-768 takes around 3ms average.

I was massaging thing to see if I could bring it down even more, but I want to just get something working and then improve from there. Plus, with new tools coming out, I may end up with a different approach anyways.

////////////////////////////////////////////////////////////////////////////////////////////////////

Now comes the equally confusing step of querying bits to play back, and then stitching them together. Thanks to some help from @jamesbradbury I have an idea of how to handle the (tight) playback in fl.land~, but before it gets to that I’m going to try to see about getting “better” matching by querying multiple time frames to get a single result. At the moment that would give me an overall latency of 768, but it’d be interesting to see how/if that works. (I’ve not built this yet)

Where it gets a bit puzzly is how to best query this stuff and how to send it to play back.

So 0-64 (lets call this A for now), I can query for, and start playing back as soon as 512 samples have passed. I can also query, via a second parallel entrymatcher, the nearest match for 0-256 (B).

Even if processing and querying were instant, I’d need some kind of overlap between the two, so A will actually playback something like 128 samples, with B start off with a fade in and playing a bit long as well.

For 0-768 I would do the same, with a 3rd entrymatcher (to avoid potential crosstalk), just a bit later.

Because the analysis and querying happen in the land of (Max) slop, there’s some wiggle room everywhere, with overlaps needed for blending purposes anyways.

My initial sketch of the fl.land~-based playback presumed that I would know the results of the first couple fragments when triggering the chained process. I’m thinking now that that shouldn’t be the case, but that I should instead trigger each segment of time (A, B, C) completely independently, since waiting for each subsequent step would add an increasing (and unknown) amount of latency.

Whether that means just staying in Max-slop land and eating that extra bit of slop, or waiting an addition ca. 0.67-1ms latency between segments A and B being analyzed so they could be triggered at the same time (i.e. happen exactly 64 samples apart) is something I can test.

////////////////////////////////////////////////////////////////////////////////////////////////////////////

Once I work out this stuff, I’ll experiment with mapping the smaller analysis windows on larger blocks of audio. (e.g. analyzing realtime input of 0-64 and then using that to query for 0-512 in terms of playback, etc…).

Ok, so I managed to put some of this together. Here’s a demo video.

Hopefully it’s clear from the video, but there are three main playback bits at the bottom, the right-most one is samples 0-64, the middle one is 64-256, and the left-most one is 256 until the end of the file.

There’s a bit of extra played back by each section, along with some fade in/out.

My initial thoughts are that it’s a bit underwhelming. There’s loads to improve here, in terms of analysis/matching, as well as having tighter playback and fades lined up etc…, but with these sounds, being triggered by a snare, the transient of the acoustic snare is over represented in the sound, so everything sounds super “clacky”. I guess that is correct, so that’s good.

What is more interesting is the microsound-sounding sample playback, where I’m just querying and playing back these tiny fragments. And I also quite like the sound of the first bit and then just the sustain, so you get a kind of “hollow” sounding sample.

I think moving forward with this, it may be something where the real-time input is analyzed at a couple of stages, and then that is used to just query longer samples, rather than matching like-for-like.

And/or doing something where the sections are put together from more decompositions, where the “sustain” would be made up of an arbitrary amount of HPSS/NMF’d samples.

The windows are just too short to be meaningful, and something like the 50ms window of @tremblap’s LPT thing is (way) too slow for real-time use.

Also, I think the initial results from the spectral compensation approach are looking more promising, if a bit “slow” at the moment (fingers-crossed for fl.land~ filter magic).

2 Likes

This is fun! And sounds good.

you have entered the land of the problem of the overlap of descriptor space! It is accurate enough to be ‘boring’… which is a great problem to have! You did not get that kind of profile with single windows. So now you can curate if/when/how you select what from here! For instance, if yoo clicky, from a snare with fast attack, maybe you want to lower the volume of the resynth of that part, or filter, or feed it a corpus of smoother stuff, etc! Have you tried with a corpus of only synths, or only something that does not have obvious real physical percussive sounds in it? Matching A to A will give you A, so if you have a good matcher, time to feed it challenging spaces to match :slight_smile:

Yeah, there’s some cool bits. I guess the ‘uncanny valley’-type sounds are more interesting, for now, than trying to resynthesize “real” sounds.

I played with this a little bit. In each of those patch clusters there’s amplitude compensation going on too, so whatever grain is matched, I adjust its loudness to match the corresponding analysis window. For the 64 sample slice, it was getting turned WAY up, which makes sense since most of the energy in a snare hit is in that first little nugget, but the sound that was being played back was overly biased to that CLACK sound. So in the video the first two fragments (0-64, 64-256) have their gain set to 1.0, and the longer bit still does the loudness compensation.

Not yet. I have a lot of individual samples, but mostly curated from the years where having 32ish samples was “alot”. (where having 16 sample pads, with a velocity crossover point, was super detailed (lol)) So I have very few sample library things (does corpus still refer to a body of samples that make sense holistically?) that have enough samples to work in this approach.

Part of the reason I’ve been inquiring/pushing on the fluid.ampgate~ thing is I want to do some automated(ish) segmentation in Reaper to more easily breakup things I can record without having to spend hours trimming the starts/ends of files.

that is the way we use it. it is the result of the process of making sense of a bunch of samples. in my case, it can change per query :slight_smile:

I hope it helped. What I would do with the current tools is a first pass of ampgate indeed, then maybe a conditional pass of ampslice if the slices are too long. For the former, you can adjust the automatic greediness and other time things. For the latter, you should be able to find by eyeballing a sort of compromise depending on your envelope settings on how many samples you can move earlier… that is what most compressors do (guestimate a number linked to ascension time, mostly half) but here it is your choice… in reaper, you can just offset them all (although I would use max for batch processing)

In that case, yes, I need to make more corpora. It would be great to pick a sound world and just record for a bit, then segment things into useful nuggets.

I need to spend more time with it, but at the moment I struggle to get anything remote useful as a first step in Reaper, much less multiple/refining passes.

I think it’s also the fact that I’d be playing them back in a percussive context, so the morphology of the sound is quite important (with regards to segmentation) because weird beginnings/ends really stand out in that context.

I think @jamesbradbury had some success yesterday with this, and maybe I should do a quick complementary session to last week going verbose on using your sound and doing it… if James is game, e could do his process first and then I could do mine on your sound. Does that have any appeal?

1 Like

Yeah that would be great. I’ll put together a short fragment of audio that has a couple of the sounds I’ve been trying to extract.

edit:
actually this is problematic enough for me in terms of finding settings (and some of the other recordings are in my studio which I can’t access at the moment due to (very exciting!) construction)

http://rodrigoconstanzo.com/bucket/saxybits.zip

I’ll give it a shot! I had some serendipitous luck getting a good starting place then tweaking and learning a lot about lookback/fwd.

1 Like

I was also creating some Plumbutter recordings the other day and was having equally shitty luck with that, but those are more clear cut percussive sounds whereas these sax ones are kind of all over the place (though still having humanly perceivable clear segments).

Yes, I would be interested as well.

Okay so I got a good start here with these:

Note, I have already implemented a mode for ampgate in Reacoma where it does won’t mute the off segments and just segments like all the other slices which in this case might be more useful for this scenario.

That sounds great. Could I test it?

well, the exact parameters from your screenshot throw this error

It did work for a while and then stopped the process:

This bug is likely because there are two slices that have the exact same index in scientific notation. Lua actually handles scientific notation fine and will convert it to a number, however, the problem is that the precision output of the cli csv is low enough that two slices in the million+ range will truncate to the same string in the scientific notation. This then tries to segment on a boundary which no longer is valid.This problem will be fixed when the precision changes on the CLI tools and actually I’m fairly sure that is the case now. It is also worth updating to the latest development version where I look for and eliminate any duplicates in the table of slice points once the data is corralled into lua land.

One day, if you are up for it, we should get you set up with git so that you can quickly pull new changes :). I’m not sure what you’re experience using git or version control is but it doesn’t necessarily mean going to the command-line now because there are lots of neat GUI tools. Then you can always be at the bleeding edge (if you want to!). Let me know if thats interesting to you, or not.

Either way, here is a latest build which should (relatively untested) guard against these issues.
https://github.com/jamesb93/ReaCoMa/archive/devel.zip

1 Like

I literally just fixed a bug rod found, so if you were speedy and jumped straight into the download make sure to download again (17:37 UK time).

James, your development and speedy responses are very much appreciated. Let’s have a zoom about git. I used it a few years back.

Let’s! Anyone else interested can join too if that is useful for them. Perhaps we could do it before or just after this weeks FluCoMeet?

After would work for me.