Ello,
I recently followed the youtube tutorial ( Building a 2D Corpus Explorer ) for setting up a 2D table of classified slices in Max MSP.
I was wondering, what kind of approach I would need to take to be able to analyse live audio input in a way that can be compared against all the slices for simillarity. My thinking was that if there is a way to analyse the live audio in a way that would result in where it would land in the 2D table, then there is no need to compare it against every single slice.
However, the way my current 2D corpus is created is by using bufmfcc, bufstats and bufflatten and then reduced to 2 dimensions using umap and normalised. To recreate this pipeline with the live/non buf modules, is not going to work, especially the normalisation stage.
So I’m not sure how to proceed, I have some ideas on how to try this but I am very new to FluCoMa and I thought I should start by asking here first, as there are many modules and approaches which I am not aware of. My end goal would be to have a 2D corpus and to be able to look up points based on their simillarity to live audio input.
You could make a pipeline that could do this with the live input. Take a look at using these objects, in this order:
fluid.mfcc~
fluid.stats
fluid.list2buf
fluid.umap~ (same one you used for the plot) with transformpoint message
fluid.normalize~ (same one you used for the plot) with transformpoint message
This should give you the 2D (xy) position that would coordinate to your live input. I have an example of this but it’s not on this computer–I’ll try to remember to follow up and post it here when I get back to that harddrive.
//===================
In case it’s useful this example shows how you might find the nearest slice, not by comparing to every other, but by using a KDTree. Might be worth checking out as well.
Let me know how it goes and if you come across more questions on the way!
Thank you, this is basically the setup I was trying to do, but I was getting a bit lost trying to match my channels and settings between them. But its encouraging to know I was on the right track.
The other thing I wasnt sure would translate well is the normalisation, with the corpus, is normalisation applied based on the values of all the analysed slices? when you have live audio would the normalisation not yield a different mapping?
Thank you for that example, I hadn’t seen that before, I’ll def check it out,
Yeah, these are good questions–if you think that the ranges of the data in your live audio analyses will be very similar to the ranges of the data you use in training, then it will be fine to use the fluid.normalize~ that was fit on the training data for scaling the live audio analyses.
If you think the ranges of the data of the live analysis might be quite different, the there are a few things you might try.
First is to not use any scaling, which for some purposes can be quite poor, but with MFCCs usually seems to work pretty well.
You might also try scaling based on pre-determined ranges if you know ahead of time what it will be (for example, a cello that will only play in the range of midi notes 36-60, so you could just normalize both the training data and the live data (scaled_value = (incoming_value - 36) / 24)
Lastly, you could try scaling the training data and live input data separately. This may be desirable if the range of values in your live analyses is much smaller than the range of values in your corpus (for example if you’re interested in matching pitch, but input is flute sounds and your corpus is a whole orchestra of samples…). Scaling them separately like this will help the “full range” of your input analyses access the “full range” of sounds in your corpus. There are some ways to get at having a fluid.normalize~ do a kind of “rolling” normalization by having it fit on a batch of the last n incoming samples and use that fitting to transformpoint for the next set of incoming points as they come in. You could also just keep track of the maximum value seen and minimum value seen and do some math on that (see pseudo code below)
Ah good point to do an approximate scaling thats applied the same to both items.
I am taking a slow step by step approach with your advice in mind.
My first hurdle, is that the fluid.bufmfcc (which I set to work with 1x channel/mono) is returning 3x channels, from what I understand, 3x MFCC Coefficients.
But the fluid.mfcc object outputs 1x channel, even though it has the same input parameters.
I think I get it:
fluid.mfcc is outputting each Coefficient as a list in one “channel”, each list output represents a segment in real time / a block? I’m not sure what settings control this time segment, and assume its doing some sort of averaging internally?
fluid.bufmfcc is storing each coefficient per channel, with the value for each coefficient stored per sample of the specified slice.
so going forward, this is why the bufstats~ is needed for the buffer workflow, to reduce down the entire slice of samples to one sample representing their average. and for the real time input I can use fluid.stats with matching sample/coef count as the argument, and @history to smooth it.
At this stage, in theory I have the same formatting from the buffer and the real time audio, the only difference is how they are formatted, say I’m going with 6x coefficients for the mfcc,
the buffer, after bufstats should result in 6x channels with 1x sample (I set it to output mean only)
the live input from fluid.mfcc will be 1x channel with 6x samples, and in theory not need any statistics done to it, as its returning the 6x coefficients for whatever the time block is that it analyses in real time.
So the next node in the buffer pipeline is fluid.bufflatten, which at this stage should format my buffer in the same format as the live input, since bufflatten attaches the channels one after the other, and they are 1x channel and 1x sample, the result should be in the same format now?
For the umap stage, as you said, I can use it for both pipelines, it will map from x amount of “samples” per channel to 2x samples that I can then scale accordingly for the 2D table in the same way for live and buffer values. so the find nearest coordinates i send to kdtree should be accurate.
At first I was really confused but I think writing out my question has helped a lot already, I’m going to try this out now, let me know if my approach is correct!
Thank you for your time!
it is outputting one list for every fft frame. it is not doing any internal averaging.
right. it writes into the features buffer one channel for each coefficient requested and the one channel for each fft frame in the slice being analyzed.
yes, this is the right idea. you may find that you don’t necessarily need to try to match the number of values used for averaging in the buffer and non-real-time examples. I usually find tweaking @history to change the sound in important ways, but not always in relations to whatever was used for the buffer anlaysis.
you’re right that fluid bufflatten is next, but what it should give you is one channel with numcoeffs samples (in your example, 6).
yes! excellent.
My gut says you might want more than 6 MFCCs. Definitely try it and see if it works for you though! I usually start with 13 and might go up from there, I’ve never tried it with 6.
Here’s the patch I was talking about. i hope it helps. Let me know if more questions arise!
Thank you so much for going through my long post, and for sharing your approach, this is most helpful.
Looking at your file I realise that my step of going through umap and using a 2D table may not be needed, since umap distributes points based on their neighbours/distance to each other in an interative procedure. If there was some way to apply the same mapping the umap used to create the buffer and then consider the live input based on this mapping, it could work, but I’m not sure if thats possible.
I tried your patch and it seems to be working well! There was some parts which I didnt quite get my head around yet, like the useage of transformpoint, but I need to take some proper time off to re-visit this project again properly. Im back to work now so I dont have as much time as I’d like to mess with it. But as soon as I do I’ll be sure to get back on here with updates / more questions.
That was the last piece of the puzzle for me, thank you guys!
I saw that the normalize node also accepts transform point as a message, so I can umap and normalise using the same mapping as the corpus, so cool.
hello everyone! I thought that it was a valid place to ask a similar question: how do you apply this live logic of live matching of the corpus entries to the audio input in Supercollider @tedmoore ? In Max , it does almost what I want with the example patch by @rodrigo.constanzo , but I would prefer to stay in SuperCollider. I checked this topic , but could not really get it working. thanks a lot! Best, Alisa
I have very little experience in sc, but I imagine similar concepts of “apples to oranges” to deal with the (generally shorter) realtime analysis vs the (potentially longer) offline analysis.
Hi @Alisa, I’ve just realized that I don’t know if there is an example that does what you’re asking. This example almost does what you’re asking about.
Here’s some code of mine that does it. I’ve plucked it out of a bunch of other stuff, so sorry it won’t really work right-off-the bat, but perhaps you can eyeball what it’s doing and modify it to your own purposes. If you get a version cleaned up and share it back here, I’ll be happy to take a look. (Sorry I don’t have the time to clean it up myself right now).
making the dataset
~createDataSet = {
arg server, name = "default_name", sourceFilesArray, noveltySliceThresh = 0.6;
server.waitForBoot{
var sourceBuf = Buffer(server);
var sourceMono;
var pos = 0;
var hopSize = 512;
var minSliceLength;
var source_indices = Buffer(server);
var features_buf = Buffer(server); // a buffer for writing the MFCC analyses into
var stats_buf = Buffer(server); // a buffer for writing the statistical summary of the MFCC analyses into
var flat_buf = Buffer(server);
var loud = Buffer(server);
var scaled = Buffer(server);
var median_loud = Buffer(server);
var ds_play_dict = Dictionary.newFrom([
"data",Dictionary.new,
"cols",3
]);
var ds_loud = FluidDataSet(server);
var ds = FluidDataSet(server);
var tree = FluidKDTree(server,2);
sourceFilesArray.do{
arg path;
var buf = Buffer.read(server,path);
path.postln;
server.sync;
FluidBufCompose.processBlocking(server,buf,destStartFrame:pos,destination:sourceBuf,action:{buf.free});
pos = pos + SoundFile.use(path,{arg sf; sf.numFrames});
server.sync;
};
if(sourceBuf.numChannels > 1){
sourceMono = Buffer(server);
FluidBufCompose.processBlocking(server,sourceBuf,startChan:0,numChans:1,destination:sourceMono,gain:-6.dbamp);
FluidBufCompose.processBlocking(server,sourceBuf,startChan:1,numChans:1,destination:sourceMono,gain:-6.dbamp,destGain:1);
}{
sourceMono = sourceBuf;
};
minSliceLength = ((0.1 * sourceMono.sampleRate) / hopSize).ceil.asInteger;
FluidBufNoveltySlice.processBlocking(server,sourceMono,indices:source_indices,algorithm:1,threshold:noveltySliceThresh,minSliceLength:minSliceLength);
server.sync;
source_indices.loadToFloatArray(action:{
arg slices_array;
if(slices_array[0] != 0){slices_array = [0] ++ slices_array};
if(slices_array.last != sourceMono.numFrames){slices_array = slices_array ++ [sourceMono.numFrames]};
slices_array.doAdjacentPairs{
arg start_frame, end_frame, slice_index;
var num_frames = end_frame - start_frame;
var id = "slice-%".format(slice_index);
"analyzing slice: % / %".format(slice_index + 1,slices_array.size - 1).postln;
FluidBufLoudness.processBlocking(server,sourceMono,start_frame,num_frames,features:loud,select:[\loudness]);
FluidBufScale.processBlocking(server,loud,destination:scaled,inputLow:-60,inputHigh:0,clipping:3);
FluidBufMFCC.processBlocking(server,sourceMono,start_frame,num_frames,features:features_buf,startCoeff:1,numCoeffs:ConcatMod.nMFCCs);
FluidBufStats.processBlocking(server,features_buf,stats:stats_buf,select:[\mean],weights:scaled);
FluidBufFlatten.processBlocking(server,stats_buf,destination:flat_buf);
FluidBufStats.processBlocking(server,loud,stats:median_loud,select:[\mid]);
ds.addPoint(id,flat_buf);
ds_loud.addPoint(id,median_loud);
ds_play_dict["data"][id] = [start_frame, num_frames];
if((slice_index % 100) == 99){server.sync};
};
server.sync;
FluidNormalize(server).fitTransform(ds,ds);
tree.fit(ds);
ds_loud.dump{
arg dict;
var ds_play;
var dir = File.realpath(ConcatMod.class.filenameSymbol).dirname+/+name;
dict["data"].keysValuesDo{
arg k, v;
ds_play_dict["data"][k] = ds_play_dict["data"][k] ++ v;
};
ds_play = FluidDataSet(server).load(ds_play_dict);
dir.mkdir;
tree.write(dir+/+"tree.json");
ds_play.write(dir+/+"ds_play.json");
sourceMono.write(dir+/+"audio.wav","wav");
"done".postln;
ds.print;
ds_play.print;
source_indices.loadToFloatArray(action:{
arg indices_array;
// post the results so that you can tweak the parameters and get what you want
"found % slices".format(indices_array.size-1).postln;
"average length: % seconds".format((sourceMono.duration / (indices_array.size-1)).round(0.001)).postln;
});
};
});
}
}
the synth:
synth = {
arg in_bus, out_bus, pauseGate = 1, gate = 1, sourceMono, c_b, predicting = 1;
var in = Mix(In.ar(in_bus,ModuleTensor.numChannelsPerBus));
var mfccs = FluidMFCC.kr(in,startCoeff:1);
var loudness = FluidLoudness.kr(in,select:[\loudness]);
var mfccbuf = LocalBuf(ConcatMod.nMFCCs);
var outbuf = LocalBuf(6);
var noveltyThresh, trig, vol;
var neighbour1, neighbour2, starts, nums, ends, phs, sig, sourceloudness;
# noveltyThresh, vol = In.kr(c_b,2);
trig = A2K.kr(Trig1.ar(FluidNoveltySlice.ar(in,1,threshold:noveltyThresh),0.1));
mfccs = FluidStats.kr(mfccs,40)[0];
mfccs = MinMaxScaler.kr(mfccs,Impulse.kr(15.reciprocal));
FluidKrToBuf.kr(mfccs,mfccbuf);
trig = (PulseCount.kr(trig) + [0,1]) % 2;
tree.kr(trig * predicting,mfccbuf,outbuf,2,lookupDataSet:ds_play);
neighbour1 = FluidBufToKr.kr(outbuf,numFrames:3);
neighbour2 = FluidBufToKr.kr(outbuf,3,3);
starts = [neighbour1[0],neighbour2[0]];
nums = [neighbour1[1],neighbour2[1]];
ends = starts + nums;
sourceloudness = [neighbour1[2],neighbour2[2]];
phs = Phasor.ar(trig,TRand.kr(-1.midiratio,1.midiratio,trig),starts,ends,starts);
sig = BufRd.ar(1,sourceMono,phs,1,4);
sig = sig * loudness.dbamp;
sig = sig * (loudness > -60).lag(0.03) * (sourceloudness > -60).lag(0.03);
sig = Limiter.ar(sig * 10.dbamp);
sig = sig * predicting.lag(0.03);
sig = sig * PauseFreeGate(gate,pauseGate);
Out.ar(out_bus,sig);
nil;
}.play(target,args:[\out_bus,outBus_,\in_bus,inBus_,\sourceMono,audioBuf,\c_b,cb]);