I’ve had this pipe dream of creating a strategy where you can take a group of audio files and create a kind of scoring system for concatenation such that you are returned some data which tells you that x -> y = 1.0 would indicate a smooth almost inaudible sequence and x -> y = 0 would be a jarring or abrupt sequence. This might be putting the cart before the horse, I suppose you could just segment so that you have nice clearn segments that are always positioned at digital silence but I’m more interested in thinking about models that could be used to arrange sounds to create flowing sequences. Perhaps there is room also for a kind of compensation process afterwards to turn a 0.8 score into a 0.9 through basic processes like cross fading/eq.
One idea I had was to use MFCC and simply take the distance measurement across the average of bands in the last n frames of a sound. n could be quite large really. In that case something like knn could create a traversible tree of segments with compatability.
Has anyone tried this or come across research on this kind of problem as niche as it sounds?