Hybrid/Layered resynthesis (3-way)

Awesome, thanks for the verbose response here.

So the “missing time” thing is what I arrived at in my initial idea and sketches. It’s also tricky to think about as there’s the absolute time, and the relative time (relative to the start of actual playback).

The times themselves are obviously subject to massaging (hence my test patch above, to see what kind of segmentation works out), and my initial choices were heavily biased towards the front of the file (to the point that pitch is useless for the first 2-3 analysis windows). It may be overdone though, but I was banking on having the segmentation start with an extracted transient (hence this stuff) with the subsequent bit perhaps having the transient removed, so they could potentially not even require fades.

I think the overlapping analysis windows is where I was leaning towards, but the math of it was hard for me to conceptualize.

I think, in spirit, I like your last example/suggestion, with the caveat that 1200 (in this case) is significantly too long to wait to start playback, as we’re pushing 20ms+ at that point and you can definitely “feel” that, particularly if the sound feeding the system is very short.

So I guess that, mathematically, the delay between the 2nd and 3rd sample (or the final two if more than three are used) has to be equal or smaller than the time between the initial attack and the start of the first sample.

So if I wanted no more than 512 samples between the attack detection and the initial playback, it would have to be something like: 88/256/768. Is that right?

The potential in jitter with regards querying time will definitely factor in, especially if the database is multiple times bigger due to containing multiple version of same sample (HPSS, NMF, transient, etc…). In terms of temporal slices, the query can just be limited to the relevant temporal slices, to avoid another dimension(s) of querying.

My, perhaps naive, view of that is that the overlaps can be extended forward a bit so that potential drop in energy mid crossfade doesn’t become apparent. Either way these sounds will be synthetic, though it would be interesting to see how well it handled stitching back the same sounds again.

////////////////////////////////////////////////////////////////////////////////////////////////////////

AAAND

In reality/context, I will probably use these staggered analysis windows to query and play back longer sounds than the initial analysis window (i.e. the first 88 samples would be used to query back the first 256 that will be played, then next window (88-256) would determine what played from 256-1024 etc…).

It would be handy/musical to query and stitch together short samples, but most of the stuff I will be playing back will be longer than my analysis windows, so it’d be about mapping those two spaces onto each other in as useful/musical a way as possible. My working theory is to take the time series/stats of the short attack and kind of extrapolate that out to a certain extent. There will obviously be a very steep point of diminishing returns with that thinking though.

Actually curious if you (@tremblap) have any thoughts on that aspect of the idea, in terms of mapping short analysis windows onto long playback.