Compare live audio input to existing corpus table coordinate

Ello,
I recently followed the youtube tutorial ( Building a 2D Corpus Explorer ) for setting up a 2D table of classified slices in Max MSP.

I was wondering, what kind of approach I would need to take to be able to analyse live audio input in a way that can be compared against all the slices for simillarity. My thinking was that if there is a way to analyse the live audio in a way that would result in where it would land in the 2D table, then there is no need to compare it against every single slice.

However, the way my current 2D corpus is created is by using bufmfcc, bufstats and bufflatten and then reduced to 2 dimensions using umap and normalised. To recreate this pipeline with the live/non buf modules, is not going to work, especially the normalisation stage.

So I’m not sure how to proceed, I have some ideas on how to try this but I am very new to FluCoMa and I thought I should start by asking here first, as there are many modules and approaches which I am not aware of. My end goal would be to have a 2D corpus and to be able to look up points based on their simillarity to live audio input.

Best,
Nik

You could make a pipeline that could do this with the live input. Take a look at using these objects, in this order:

fluid.mfcc~
fluid.stats
fluid.list2buf
fluid.umap~ (same one you used for the plot) with transformpoint message
fluid.normalize~ (same one you used for the plot) with transformpoint message

This should give you the 2D (xy) position that would coordinate to your live input. I have an example of this but it’s not on this computer–I’ll try to remember to follow up and post it here when I get back to that harddrive.

//===================

In case it’s useful this example shows how you might find the nearest slice, not by comparing to every other, but by using a KDTree. Might be worth checking out as well.

Let me know how it goes and if you come across more questions on the way!

Hi Ted,

Thank you, this is basically the setup I was trying to do, but I was getting a bit lost trying to match my channels and settings between them. But its encouraging to know I was on the right track.

The other thing I wasnt sure would translate well is the normalisation, with the corpus, is normalisation applied based on the values of all the analysed slices? when you have live audio would the normalisation not yield a different mapping?

Thank you for that example, I hadn’t seen that before, I’ll def check it out,

1 Like

Yeah, these are good questions–if you think that the ranges of the data in your live audio analyses will be very similar to the ranges of the data you use in training, then it will be fine to use the fluid.normalize~ that was fit on the training data for scaling the live audio analyses.

If you think the ranges of the data of the live analysis might be quite different, the there are a few things you might try.

  • First is to not use any scaling, which for some purposes can be quite poor, but with MFCCs usually seems to work pretty well.
  • You might also try scaling based on pre-determined ranges if you know ahead of time what it will be (for example, a cello that will only play in the range of midi notes 36-60, so you could just normalize both the training data and the live data (scaled_value = (incoming_value - 36) / 24)
  • Lastly, you could try scaling the training data and live input data separately. This may be desirable if the range of values in your live analyses is much smaller than the range of values in your corpus (for example if you’re interested in matching pitch, but input is flute sounds and your corpus is a whole orchestra of samples…). Scaling them separately like this will help the “full range” of your input analyses access the “full range” of sounds in your corpus. There are some ways to get at having a fluid.normalize~ do a kind of “rolling” normalization by having it fit on a batch of the last n incoming samples and use that fitting to transformpoint for the next set of incoming points as they come in. You could also just keep track of the maximum value seen and minimum value seen and do some math on that (see pseudo code below)

pseudo code:

max_val = max(max_val,incoming_val)
min_val = min(min_val,incoming_val)
scaled_value = (incoming_val - min_val) / (max_val - min_val)

You might check out the Why Scale and Comparing Scalers articles! Once you start playing around with the scalers, don’t forget to check out fluid.standardize~ and fluid.robustscale~ too.

Yep. It’s on my to do list for next week when I reunite with that harddrive!

Let me know if that’s all useful and what other questions you come upon!

Ah good point to do an approximate scaling thats applied the same to both items.
I am taking a slow step by step approach with your advice in mind.

My first hurdle, is that the fluid.bufmfcc (which I set to work with 1x channel/mono) is returning 3x channels, from what I understand, 3x MFCC Coefficients.
But the fluid.mfcc object outputs 1x channel, even though it has the same input parameters.
I think I get it:

  • fluid.mfcc is outputting each Coefficient as a list in one “channel”, each list output represents a segment in real time / a block? I’m not sure what settings control this time segment, and assume its doing some sort of averaging internally?
  • fluid.bufmfcc is storing each coefficient per channel, with the value for each coefficient stored per sample of the specified slice.

so going forward, this is why the bufstats~ is needed for the buffer workflow, to reduce down the entire slice of samples to one sample representing their average. and for the real time input I can use fluid.stats with matching sample/coef count as the argument, and @history to smooth it.

At this stage, in theory I have the same formatting from the buffer and the real time audio, the only difference is how they are formatted, say I’m going with 6x coefficients for the mfcc,

  • the buffer, after bufstats should result in 6x channels with 1x sample (I set it to output mean only)
  • the live input from fluid.mfcc will be 1x channel with 6x samples, and in theory not need any statistics done to it, as its returning the 6x coefficients for whatever the time block is that it analyses in real time.

So the next node in the buffer pipeline is fluid.bufflatten, which at this stage should format my buffer in the same format as the live input, since bufflatten attaches the channels one after the other, and they are 1x channel and 1x sample, the result should be in the same format now?

For the umap stage, as you said, I can use it for both pipelines, it will map from x amount of “samples” per channel to 2x samples that I can then scale accordingly for the 2D table in the same way for live and buffer values. so the find nearest coordinates i send to kdtree should be accurate.

At first I was really confused but I think writing out my question has helped a lot already, I’m going to try this out now, let me know if my approach is correct!
Thank you for your time!

Hello @NNenov,

happy new year.

it is outputting one list for every fft frame. it is not doing any internal averaging.

right. it writes into the features buffer one channel for each coefficient requested and the one channel for each fft frame in the slice being analyzed.

yes, this is the right idea. you may find that you don’t necessarily need to try to match the number of values used for averaging in the buffer and non-real-time examples. I usually find tweaking @history to change the sound in important ways, but not always in relations to whatever was used for the buffer anlaysis.

you’re right that fluid bufflatten is next, but what it should give you is one channel with numcoeffs samples (in your example, 6).

yes! excellent.

My gut says you might want more than 6 MFCCs. Definitely try it and see if it works for you though! I usually start with 13 and might go up from there, I’ve never tried it with 6.

Here’s the patch I was talking about. i hope it helps. Let me know if more questions arise!

02 concat with scaler live input.maxpat (65.4 KB)

-T

Happy New Year Ted!

Thank you so much for going through my long post, and for sharing your approach, this is most helpful.
Looking at your file I realise that my step of going through umap and using a 2D table may not be needed, since umap distributes points based on their neighbours/distance to each other in an interative procedure. If there was some way to apply the same mapping the umap used to create the buffer and then consider the live input based on this mapping, it could work, but I’m not sure if thats possible.

I tried your patch and it seems to be working well! There was some parts which I didnt quite get my head around yet, like the useage of transformpoint, but I need to take some proper time off to re-visit this project again properly. Im back to work now so I dont have as much time as I’d like to mess with it. But as soon as I do I’ll be sure to get back on here with updates / more questions.

Thanks again,
Nik

This sounds like what UMAP’s tranformpoint message does. From the docs:

3 Likes

Indeed, @NNenov, as @weefuzzy says, I think fluid.umap~'s transformpoint message is what you’re looking for–let us know if that does it!

1 Like

That was the last piece of the puzzle for me, thank you guys!
I saw that the normalize node also accepts transform point as a message, so I can umap and normalise using the same mapping as the corpus, so cool.

2 Likes