Hybrid/Layered resynthesis (3-way)

Ok, so finally got to building some of this since I’ve had a bunch of time on my hands recently.

The idea is to build on the onset descriptors idea which I’ve posted elsewhere on the forum to analyze an incoming bit of audio (generally a percussion/drum attack) and then create a hybrid/layered resynthesis of it where the transient is quickly analyzed and replaced with an extracted corpus transient, while then analyzing a mid-term window of time, crossfading to an appropriate sound/layer, and finally a long-term fadeout, which again, would hopefully be hybrid.

So the generally idea is something like @a.harker’s multiconvolve~ where multiple windows of time are analyzed and stitched together as quickly as possible.

At the last plenary, after the concert, I started building something that extracted all the transients from a corpus in an effort to start with simple “transient replacement”. After some messing around and playing with settings, along with @a.harker’s suggestions, I got something that works and sounds ok, and have built a patch that does this ((not super tidy) code attached below).

code.zip (98.5 KB)

The first bit of funky business is that transient extraction (via fluid.buftransients~ at least) is not a time-based process, so even though I am slicking off the first 50ms chunk of audio before running the transient extraction on it, I’m getting back “a little window of clicks”, which represents the extracted transient. It can also end up zero-padded too, which I’ve then uzi-stripped out.

So I’ve done that, and then analyzed those little fragments for a load of stuff, which the intention of focussing on loudness, centroid, and flatness (my go-to descriptors for percussion stuff). Then I’ve run the whole thing through the corpus querying onset descriptor process which I showed in my last plenary talk.

(shit video with my laptop mic (cuz JACK is too confusing(!)))


It sounds… ok. But it is a proof of concept so far.

I think the transients might be too “long” for now, especially if they are meant to be transients transients. This is going with @a.harker’s suggested settings of @blocksize 2048 @order 100 @clumplength 100. There’s also a bit of latency too, as I’m using a 512 sample analysis window (ca. 11ms), whereas I think in context, this would be much smaller to properly stagger and stitch the analysis.

So the first part of this is just sharing that proof of concept, while opening that up to suggestions for improvements and whatnot.

The next bit, however, is to ask what the next two “stages” should be. In my head, it makes sense for the first bit to be a super fast and as short of latency as possible “click”, and the final stage should be some kind of hybrid fluid.bufhpss~ cocktail tail where the desired sustain is put together from however many layers of pre-HPSS’d sustains as it takes.

Where I’m struggling to think is what the “middle” stage should be. And what scale of time I should be looking at, particularly if the first stage is of an unfixed and/or unknown duration. Should I just have a slightly larger analysis window and do another HPSS’d frankenstein, or should it be a vanilla AudioGuide-style stacking of samples so it’s more “realistic”?

Is there a technical term for this “middle bit”? (i.e. not the transient, and not the sustain)


And finally, a technical question.

What would be a good way of analyzing and querying for something that will be potentially assembled from various layers an parts. As in, I want to have a database of analyzed fragments, probably as grains and slices, and each one broken up via HPSS and perhaps NMF, and then I want to be able to recreate a sample with as many layers of the available files as required (again, ala AudioGuide)? (p.s. I want it to happen in real-time…of course)

Will this be a matter of having some kind of ML-y querying, and/or is this possible with the current/available tools?

This is an interesting implementation of ideas we definitely have bounce around a lot in the last few years, and it seems it comes together in a way that is quite idiosyncratic to your practice, so I’m happy!

Not that I’m aware of. Again, transient, attack, allure, many words that mean many things depending on to whom you ask. Schaeffer, Smalley, synth builders, plugin designer, all refer to the first elements of a sound object to create a taxonomy.

In my example I shared in the last plenary (APT) you saw I decided to treat 0-50, 50-150, 150-500 ms as my onset windows, because that made sense to me then. I decided to make them equal in value to describe the time series, which again made sense to me.

So for you, how to chose it could be done by analysing your replacement sound in 2 period of time (what you already replace as ‘transient’ and the middle bit) and weight them in your search for a replacement, and then replace just the 2nd bit that way. It is a vague subversion of the idea of a markov chain, because I use the ‘preceding element’ to chose the current. Does ti make sense?

Again, this depends on what you replace, how, etc. I am exploring various ways now, but usually the costly way of doing stuff in parallel. It is faster to prototype, and then when you find something that sounds good, you can optimise because you know what you use (for instance, just the perc part of first bit, just the pitch part of the middle, etc)

I hope this helps!

btw this gave me an idea of an interface that would allow to make a sort of matrix to make such pre-processing… It might take some time but I’ll post it here - but imagine having the ability to define time slots to be considered in a target, and for each slot deciding the weight of various descriptors… watch this space!

Those general ballparks work well, but I may just shrink things up a bit more and treat more as a sustain. Maybe something like 0-20, 20-100, 100+, although that may be too front loaded for sounds that open up slowly.

Yeah, that could work. I guess part of this idea is to lean into the ‘hybrid’ approach (was that the term you guys used, I don’t remember the distinction between hybrid and synthetic or whatever else was discussed at the last plenary), but basically having a sound that is made up of layers of decomposed actual sounds (e.g. HPSS, NMF, etc…).

And with latency in mind, I thought that I can get transients down to something very tiny, and analyze/query that super quickly while working on the next bit which is more perceivable.

It could just be that everything after that, in all the time windows, is a frankenstein/hybrid thing, with perhaps different weights being put on different aspects of the sounds (i.e. the “middle bit” having more weight put on the transient/energy-oriented descriptors/statistics, and the sustain having more weight put on the tone/pitch-oriented descriptors/statistics).

There’s also the “apples to apple-shaped oranges” thing in that I want to be able to query files that are significantly longer than analyzed audio, so some of the process will also be extrapolating out the initial analysis data with arbitrarily long samples.

Yes please!

There are definitely several aspects of this that I have no idea how to build, or would be incredibly clunky to build with my knowledge (and tools). The pre-processing is definitely a big part of that, in that chunking up files and analyzing (and potentially tagging, since for now the querying will be done manually, rather than via ML, so knowing what’s what would be useful) will be a big faff.

I think that you should analyse it twice: once for super quick replacement (first pass) and still include it in the middle and end bit as a way to give you better match for the sustain part. You can always weigh it differently to the sustain (consider it less important in the match) but it is valid info maybe?

It is, and I guess that could inform what gets replaced in the medium bit, but the idea would be to analyze the transient and replace that immediately.

There’s also some wiggle room (I would think) in the amount of overlap and fading available between fragments, particularly with the transient extraction which has no fixed length.

that is what I meant. You replace transient with fast analysis of transient, AND you replace middle bit with analysis of transient and analysis of beginning of middle bit, and you can weigh them separately. the same apply with the sustain, which could be from the weighted analysis of middle and beginning of tail.

1 Like

Aaah right. I understand. We wouldn’t hear the transient of the middle bit, but it would instead be part of the querying for a suitable middle bit.

1 Like

indeed. and you can weigh this ‘influence’ as well. and do the same for the 3rd bit with the 2 first bits of ‘influence’

1 Like

So I’ve been playing with the transient stuff again and wanted to come up with a slicing that works for me, with the intention of doing crazy fast stitching.

With the idea of replacing stuff in real-time, having front-loaded short bits makes sense since I can stitch as I go, rather than having to wait 50ms for the first chunk of audio to play. I’m also going with the names of ADSR for now, which lines up well since I’ve got four main slices I’m testing with:

  • attack: 88 samps (1.995465ms)
  • decay: 88-512 (11.61ms)
  • sustain: 512-2205 (50ms)
  • release: 2205-6615 (150ms)

For the numbers I based things on the size I’ve been getting from fluid.buftransients~ (ca. 88 samples) and what a reasonable amount of latency is for real-time use (512 samples). But it’s fairly arbitrary as it’s numerically/computationally based, rather than on what would perceptually stitch together well.

So I built a patch that would stitch random samples together at those breakpoints. Cool, this kind of works. There’s a lot of variations you can do with the first three chunks where it still sounds believable (more on this below). The final bit… not so much. Granted, I’m literally making random assemblies, and in context, these should hopefully follow each other in some kind of markov-y way.

I then made it so you can tweak the breakpoints easily and hear the results. Here’s the commented patch:


One thing that strikes me immediately in doing this is that fades are super needed. Yes, these fades.

I know there’s been some talk of this several times, but for use cases like this it’s pretty impossible without some kind of fading/smoothing, particularly as the segments get longer and further into the sample. Surprisingly the first few bits can be smash cut, but sustains are too jarring.


So I wanted to share this franken-sample patch as it’s quite handy for testing. My game plan is to play with this with a full tank of gas tomorrow and find some breakpoints that work musically, then doing some batch analysis with these breakpoints in mind, and try to do some brutalist smashing of these together.

Doing that in a sample accurate manner is going to be important, so I will likely have to end up in the land of the fl.lib~ for now. I don’t know if there are any playback objects in the pipeline for TB2 as I would think being able to playback bits/layers/chunks/slices/pieces would be central to the querying/matching side of things. It would definitely be useful to be able to do it “all” inside the FluCoMa-verse.

Fades aren’t going to be added to bufcompose in time to make you happy. I would suggest that

  • for real-time, using framelib to assemble the audio itself
  • for offline, using jitter for a much greater range of things you can do to a buffer (although I think the plan is that framelib is to become buffer-capable…)
1 Like

Ok, taking a look at the framlib stuff for the stitching (and fades too I guess, for real-time stuff).

The offline one was just a mock up really, to see where these breakpoints would make the most sense. Being able to audition with fades is handy, but probably not worth doing a whole workaround thing for now (I came up during the “jitter is a paid extra” era, so I never really internalized doing “normal” shit with jitter).

The fade amounts will be critical too I think, especially to avoid weird artefacts in the first couple of chunks, as they are tiny. Is there a rule-of-thumb minimum fade time to avoid AM-y shit? At the moment my shortest segment is 2ms, so not a lot of wiggle room there in terms of fading.

It’s a bit of a brain fuck in terms of figuring out how the real-time analysis would relate to the stitched playback.

Even drawing it out I’m not sure I really understand how it would line up temporally.

So at the top is each slice (labelled ADSR), and the consecutive overlapping analysis windows. So analysis window 1, analysis window 2 (made up of segments A and D), analysis window 3 (made up of A, D, and S) etc…

That seems alright, and I guess sensible too, in that the second analysis window would be A+D, rather than just being D.

So playback wise (the lower three bits), nothing can happen until the analysis window of A has happened. So once that has happened, I would play back A, and then carry on playing for a length of time equal to D while fading it out. This particular D would not have been analyzed, and would just be tagging along with the analyzed A.

Once analysis window 2 has happened, I would then play back the second half of D, while fading in, then play all of S and a bit of R while fading out. So again here, the only part that was actually analyzed was the fade in from D. The S and R are along for the ride.

And similarly, when analysis window 3 is done, I would play back the ending of S, then carry on playback from there.

This makes sense in terms of latency and stitching together something that is equal length to what was analyzed. BUT in drawing it all out (it took me like 5 drawings to arrive to this version!) I’m struck by the fact that even though I’m analyzing large chunks, most of the playback in this model is of audio that has not been analyzed. It is primarily what follows what has been analyzed.

My intuition tells me that this would probably sound ok, in that it would produce a kind of markov-ian thing where the bit that follows the analyzed bit has a high likelihood of having done so, and if each segment is analyzed in sequence, they would probably sound alright following each other (?!).


One perk to this, theorized, approach would be that the effective latency would be down to the length of the shortest analysis window. In this case, 88 samples. That’s fucking ridiculous.

I know from all the other querying/matching, that 512samples is an ok amount of time to wait. So perhaps I can just “zoom out” this whole thing where the smallest analysis window is 512 samples, and things zoom out from there. I would still have the same kind of problem with hops and such.


Another approach would be to still stitch together from these tiny ass sounds (so the first playback segment would be 88 samples long), but the whole process is just postponed so it doesn’t start until after 512 samples have passed. So after that point there would be more real-world time that has been analyzed and stitched.

Ugh, this kind of shit is a real brain fuck!

Is there some obvious hop math I’m overlooking here in terms of best practice?

Man, I loaded my modular synth corpus (750 small segments) and played with this and it is really fun! It suffers the absence of cross-fade as they are glitch sounds and it is really fun.

Soon, the code will be on GitHub so PullRequests will be considered, should you want to code it in C++ :slight_smile: Or maybe @pasquetje and @jamesbradbury some other CCL wizkid will do it, who knows :wink: For the foreseeable future, our little team of 3 lovely people works on the second toolbox and its documentation…

What I don’t understand though, is why you want to assemble all of this in a single buffer for real-time use (for now). Why don’t you have a 4 voice player, one for each component? No need for framelib there, just have each analysis point find its best match and play along. You can even use the length / latency of the analysis to actually stitch stuff together!

I can make a drawing if what I just said is not clear, especially since some of these larger-scale multi-resolution analysis ideas come in part from my brains in that other post. Maybe I can try to be clearer…

1 Like

Yeah quite fun to play with!

As an experiment I tried checking the composited buffer for clarity and then iterating until it was above some threshold, but the correlation between sounding interesting/believable and clarity wasn’t strong enough to bother putting it in for the sharing.

This is mainly to try to figure out where the breakpoints between samples should be, before figuring out the playback thing. (just had a quick chat with @jamesbradbury, which got me on the path to doing the sequential playback with fades in fl.land~. A bit clunky since fades may be asymmetrical given the lengths of what comes before/after each segment, but seems solve-able.

Yeah that would be nice. I can make sense of what each segment of playback should be, but how that relates to analysis windows, and how those analysis windows relate to “real time” is a brain fuck…

ok I thought about it overnight, and I’ll try to make it as clear as I can, and feel free to ask for clarification. i have not implemented that yet in real-time, but it is on my todo list as you know for some time now… since the grant app, actually :wink:

This is a picture of your metal hit from the other thread. I’ve divided time in 3 windows instead of 4, the principle is the same. For now, we’ll consider the start to be perfectly caught so time 0 is 0 in the timeline. I will also consider that the best match answer is immediate for now, because we just need to understand the 3 parallel queries going on. Here is what happens in my model:

  • at time 0 (all numbers in samples), an attack is detected, so the snapshot of the address where we are in my circular buffer is identified. Let’s call it 0 for now as agreed. I send this number in 3 delays~: 400, 1000, and 2200. These numbers are arbitrary and to be explored depending on the LPT ideas you’ve seen in the last plenary, but in effect, they are how you schematise time, not far from ADSR for an envelope. Time groupings. Way too short for me but you want percussive stuff with low latency, so let’s do that. Let’s call them A-B-C.

  • at time 400 (which is the end of my first slot) I will send my matching algorithm the query of 400 samples long from 0 in database A, and will play the result right away from its beginning, aka the beginning of the matching sound.

  • at time 1000, I will send my matching algo the query of 1000 from 0 in database B. When I get the query back, I will play the nearest match from 400 in until its end (1000) so I will play the last 600 samples only. Why? Because I can use the first 400 to bias the search, like a Markov chain, but I won’t play it. Actually, this is where it is fun, is that I would try both settings: search a match for 0-1000 in database B1 and search from 400-1000 in database B2. They will very likely give me different results but which one will be more interesting is depending on the sounds themselves.

  • at time 2200, I will send my query to match either from 0 (C1) from 400 (C2) or from 1000 (C3) again depending on how much I want to weigh the past in my query. That requires a few more databases. Again, I would play from within the sound, where I actually care about my query.

Now, this is potentially fun, but it has a problem:
there will be no sound out between 800 and 1000! If I start to play a 400 long sound at 400, I’ll be done at 800, by which point I won’t be ready for my 2nd analysis at 1000. The same applies between 1600 and 2200. That is ugly.

So what needs to happen is that you need to make sure that your second window is happening during the playback of the first, and the 3rd during the playback of the 2nd. There are 2 solutions to this: you can either play each window for longer, or you can make sure your window settings are overlapping. I would go for the latter, but again there are 2 sub-solutions: that will change how you think your bundling (changing the values to 700/1400/2100 for instance) or with a bit more thinking, you delay the playback of each step to it matches. With the values of 400/1000/2200 you would need to start your first sound at 1200, so it would play

  • from A- start at 1200 playing 0-400 (to 1600)
  • from B- start at 1600 playing 400-1000 (to 2200)
  • from C- start at 2200 playing from 1000 up

so that would need adding 2 cues/delays, one at time 1200 and one at time 1600.

I hope this helps? Obviously, all of this is only problematic in real-time. Sadly, we can’t see in the future. More importantly, the system would need to consider the query time as well before starting to play, since the returning of the best match would never be instant and dependant on database size…

I hope this helps a bit? It at least might help understand why your problem is hard…

1 Like

Awesome, thanks for the verbose response here.

So the “missing time” thing is what I arrived at in my initial idea and sketches. It’s also tricky to think about as there’s the absolute time, and the relative time (relative to the start of actual playback).

The times themselves are obviously subject to massaging (hence my test patch above, to see what kind of segmentation works out), and my initial choices were heavily biased towards the front of the file (to the point that pitch is useless for the first 2-3 analysis windows). It may be overdone though, but I was banking on having the segmentation start with an extracted transient (hence this stuff) with the subsequent bit perhaps having the transient removed, so they could potentially not even require fades.

I think the overlapping analysis windows is where I was leaning towards, but the math of it was hard for me to conceptualize.

I think, in spirit, I like your last example/suggestion, with the caveat that 1200 (in this case) is significantly too long to wait to start playback, as we’re pushing 20ms+ at that point and you can definitely “feel” that, particularly if the sound feeding the system is very short.

So I guess that, mathematically, the delay between the 2nd and 3rd sample (or the final two if more than three are used) has to be equal or smaller than the time between the initial attack and the start of the first sample.

So if I wanted no more than 512 samples between the attack detection and the initial playback, it would have to be something like: 88/256/768. Is that right?

The potential in jitter with regards querying time will definitely factor in, especially if the database is multiple times bigger due to containing multiple version of same sample (HPSS, NMF, transient, etc…). In terms of temporal slices, the query can just be limited to the relevant temporal slices, to avoid another dimension(s) of querying.

My, perhaps naive, view of that is that the overlaps can be extended forward a bit so that potential drop in energy mid crossfade doesn’t become apparent. Either way these sounds will be synthetic, though it would be interesting to see how well it handled stitching back the same sounds again.



In reality/context, I will probably use these staggered analysis windows to query and play back longer sounds than the initial analysis window (i.e. the first 88 samples would be used to query back the first 256 that will be played, then next window (88-256) would determine what played from 256-1024 etc…).

It would be handy/musical to query and stitch together short samples, but most of the stuff I will be playing back will be longer than my analysis windows, so it’d be about mapping those two spaces onto each other in as useful/musical a way as possible. My working theory is to take the time series/stats of the short attack and kind of extrapolate that out to a certain extent. There will obviously be a very steep point of diminishing returns with that thinking though.

Actually curious if you (@tremblap) have any thoughts on that aspect of the idea, in terms of mapping short analysis windows onto long playback.

It really depends on what you want to bundle together. 512 samples 3 times in a row would do that:
0: start recording
512: playback of A from 0 to 512
1024: playback of B from 512 to 1024 (having analysed 0-1024(b1) or 512-1024(b2)
1536: playback of C from 1025 to 1536

(if you want to give the computer a bit of time to find it all, you’ll need indeed to make it shorter)
I find that writing time like this helps me visualise what is happening at each point. So for instance, with 512 of latency max, and let’s say 128 for retrieving (2 block sizes, this is probably way too much) that looks like this:
0: start recording
512: start playback
now I need to subtract my query - god I hate these exponents of 2 so let’s start again with simple numbers, 500 latency max, 100 safetly query duration

0: start rec
500: play
400: latest query
500 will play until 900 with confidence

I do the same cycle again:
900: start B playback
so 800 is latest query. we already covered 0-400 so that would be for 400-800
so playback is 900-1300

1300 start playing C
so 1200 is latest query, to play 800-1200

in effect, your various window size have a cascading effect on your overall latency. Put your numbers in there, and see how they behave…

1 Like

You’ve mentioned having multiple concurrent analysis windows which influence what samples get chosen (ala an envelope follower buildup a signal which then selects between shorter and longer samples, for example).

This is definitely going to be in the mix, but I think that it would fail a bit in fairly simple circumstances. So say I’m mapping a slowish envelope follower on loudness to the inverse of duration where the more busy I’m playing, the shorter samples I’m playing back, to avoid cluttering things. And if I play slowly, I get longer samples. That makes musical sense, but it leaves out potentially powerful moments where some fast playing stops immediately, where it would be great to have a longer sample playback, but the envelope follower could still be lagging down (relative to the overall latency anyways).

Ok, I guess the relationship is whatever the initial latency is can’t be greater than the distance between subsequent steps (although it can be smaller).

And for this I may try to fudge it more where each section plays back longer than the initial block, so some previous file is fading out for 100 samples or whatever regardless.

It is a tricky thing to think about.

Another thing is trying to figure out what kind of information will be analyzed in each chunk. 512 is long enough for shitty pitch but stuff shorter than that, not so much. And if the first bit is only a(n extracted) transient, then the descriptors used to analyze it can be skewed towards that. Same goes for the longer analysis window, which can rely more heavily on pitch, rather than timbre and even loudness.

What I’ll do, once I figure out some reasonable breakpoints, is do a macro analysis that analyzes each segment for everything, and parse out in the querying what is actually useful to have.

it is, and this is why I go verbose with numbers: better than a rule I forget, is a reasoning I remember :wink:

Yes. This is what you could get for instance from analysing from time 0 all the time (preceding context) but this is only a hunch.

What I did in Sandbox #2 is to keep these mapping of immediate/short/long trends as influencers that I could assign as preset. So there was a contrarious preset, and a subservient preset, where both cases were explored and recallable on the spot, quickly, so I could meta-play my ‘cyber-partner’ if you know what I mean…