Analyzing, labeling, and composing field recordings - A sketch idea

Hey there, newish FluCoMa user here.

First of all, I would like to start by saying how impressed I am with this package. I had been working with MuBu for some time for this my project and I was finding it very difficult to wrap my head around it and to actually implement what I wanted to do. Finding out about FluCoMa made everything easier, so thanks a lot to everyone involved! Also, the documentations provided is impressive and useful, and the implementation of the package in the main “host” is really well done (with MuBu I often felt like I needed to re-learn how to patch, while with FluCoMa it feels like I’m just using Max/MSP).

Now, I would like to take this opportunity to describe the project I’m currently working on, both because I have quite a few (actually, too many) ideas for it and I’m struggling with the implementation, and to get feedbacks about it and (hopefully) spur an interesting discussion!

So, the main idea is to have a patch than can analyze raw field recordings files, slicing and labeling different parts of them in terms of “compositional” elements (so I’m not really thinking about “dog barking” or “car engine”, but more like “hits”, “gestures”, “textures”, and so on). I would then use these data to train a neural network that, once trained, can automatically do this classification on any new file I add. Then, I would have some sort of algorithmic compositional strategy telling how to treat each sound category in order to get a final piece stemming out from the different field recordings files. So far so good, and it seems like FluCoMa has all the tools I need to implement this. In fact, it seems there’s basically already a tutorial on how to do that!

I have actually started out from the “Classifying sounds with a neural network” help file, although it turns out it works pretty well with simple, easily-discernible-kind of sounds (such as the oboe or trombone used in the example), but once I put a raw 20-minute-long field recording it gets messy. The first issue I have encountered is that it’s hard to get slice points that make sense. I have tried a few objects and parameters, but the main idea I came up with is to use two (or more) slicing techniques simultaneously and then confront their results (possibly using Javascript code) so that I only accept slices which are present (well, reasonably present…) in more than one buffer. I guess this would be pretty straightforward to implement but I have yet to try, so I appreciate any tips about this! Not to mention that coming up with a way of finding good descriptors for “musically interesting” real-world-sounds is another daunting task (I’m currently thinking of mainly using MFCCs as analysis data but, again, I might combine more than one analysis). Then I would have to reduce the analysis data to a more manageable size (again, plenty of tutorials about that!) and use this new data to train the neural network, right?

And, finally, I would have to come up with the algorithmic part for the composition rules, which is something I haven’t even started to think about yet. I just know I’m not really interested in a mosaicing approach (if I understood that correctly) like “find me and reproduce the closest slice to the one that’s currently playing”, but I would say something like “these are the sounds available for this category, apply these processes to them (or to some of them)”.

Does this make any sense? Is there any obvious flaw in the ideal process that I’m overlooking? Am I aiming too high? I feel like FluCoMa is just the right tool to do all this but it’s such a vast field I periodically get lost in it.

Again, thanks a lot! I’m eager to see what ideas you all will come up with.

Hi @Bota,

Welcome to FluCoMa! Your project idea sounds like FluCoMa can definitely be helpful!

If you can post your patch and point to some specifics, that would be really useful for getting some answers going. Maybe we start by looking at the slicer part of the patch since you have some questions about that and it is often a first step in building something like this?



Hey @tedmoore,

thank for your reply!

You’re right, in the initial post I was thinking more about the general idea, so let’s start with some (super easy) code now. The following is the slicing patch:


I noticed I had previously missed the fluid.bufnoveltyslice~ object, which has given me better results than the fluid.bufonsetslice~ I had originally used (I also have a question about that: what is the unit of measurement for its kernel feature? I understand that by augmenting it we’re taking more time into account, but what exactly does it refer to?).

Anyway, in this patch I’m using them both to experiment with different settings and because, as I wrote earlier, I’m thinking about combining different slice methods (even if it might not even be necessary after all). The patch is super simple and, after some trial and error with the settings, the slicing precision is pretty amazing. My main concern in this area is that I end up with two different scenarios:

  • When using “clean” recordings (such as those coming with the FluCoMa package) I can get really clean slices, but they tend towards the mosaicing feel I mentioned before: lots of very short sounds which would be great to plot onto a map but which aren’t really what I’m going after;
  • I could obviously just tweak the parameters to make the slicing less sensitive and to get longer slices, but when using broader settings or real, noisy field recordings I end up with slices which don’t feel perceptually relevant to me. Especially when there’s a constant background element (e.g. traffic, or cicadas) it’s really hard to get meaningful slices (then again one might of course just consider the whole recording as one long single slice of background elements, since it has this persistent feature, and label it as such, but I’m sure I would slice it and label it differently if I were to do this operation by hand). All of this might of course simply be because I have yet to find the right settings for every single recording I’m trying to slice.

Anyway, I’ve started again from scratch and I’m trying to first do super simple things for now, so my next step is going to be to set up the analysis part and the dataset. I’ll post the updated patch as soon as I will have done it.

In the meantime, thanks again!



I presume you saw our quite terse reference on the object online?

The unit is in hops - we are comparing consecutive frames of analysis (using the chosen descriptor, from raw fft to mfcc to pitch) but doing a self-similarity matrix on the whole file is not super useful for segmentation, so we have a rolling window, a zoom-in if you prefer, on a smaller subset of ‘locality’ to check ‘local similarity’ with the rationale that a change of some feature is how we perceive change (short term memory). It is not our idea, the original paper is pointed at on our reference site if you fancy it.

So a kernel of 15 will look at 15 consecutive frames and compute the self-similarity matrix and then output a single number of how much change there is in that kernel. that becomes a continuous time series / descriptor that you can observe with the same parameters with FluidNoveltyFeature (I just noticed the plot is not drawing on that page, I’ll fix this soon :slight_smile: )

As with any descriptor’s time series, I like to draw them against the timeline of the sound, so I can play with parameters and see how it behaves in relation to my aural intuition. I recommend doing it here too. You can use the buffer version on 30 sec for instance to see how it all changes and where it might give you what you expect.

Happy slicing!