Finding segments that would concatenate well

I’ve had this pipe dream of creating a strategy where you can take a group of audio files and create a kind of scoring system for concatenation such that you are returned some data which tells you that x -> y = 1.0 would indicate a smooth almost inaudible sequence and x -> y = 0 would be a jarring or abrupt sequence. This might be putting the cart before the horse, I suppose you could just segment so that you have nice clearn segments that are always positioned at digital silence but I’m more interested in thinking about models that could be used to arrange sounds to create flowing sequences. Perhaps there is room also for a kind of compensation process afterwards to turn a 0.8 score into a 0.9 through basic processes like cross fading/eq.

One idea I had was to use MFCC and simply take the distance measurement across the average of bands in the last n frames of a sound. n could be quite large really. In that case something like knn could create a traversible tree of segments with compatability.

Has anyone tried this or come across research on this kind of problem as niche as it sounds?

1 Like

Another idea would be to take the next actual few frames and analyse that as the ‘natural’ following and find near-matches of that?

The whole idea reminds me of Markov chains but my knowledge there is still on the emerging side of things :smile: @weefuzzy has made a few tunes with similar ideas of traversing a corpus so that might help push you in the right direction, reading-wise…

Definitely a thing @groma knows more about than I, but I think the idea of concatenation cost is pretty standard in conctentative synthesis. There’s some discussion in this article by Diemo

You could possibly knock something up to try and find optimal offline paths using the HMM tools in MuBu.

Definitely interested in this too as I more-or-less want to do the same with the hybrid/stitched model I want to do for the future of my corpus-based sampler stuff.

The really tricky part for me, as always, is the latency since I would (obviously!) want it to be real-time, with no latency (actually there was a cool post on the lines forum where Ezra (Don Buchla’s son) was talking about negative latency with the electric marimba strikes due to how it was setup).

Now that I’ve been cracking on that sampler and kickass onset detection stuff, I’m going to try to revisit and improve the “transient replacement” stuff and then try doing a staggered analysis thing, initially just matching the nearest of each time window, but then trying to do the weighting towards having things that would concatenate nicely based on the before/after samples.

Curious how smooth things can get, with your platonic “1.0” concatenation, particularly for sustained/drone-ish material.