Analyzing, labeling, and composing field recordings - A sketch idea

Hey there, newish FluCoMa user here.

First of all, I would like to start by saying how impressed I am with this package. I had been working with MuBu for some time for this my project and I was finding it very difficult to wrap my head around it and to actually implement what I wanted to do. Finding out about FluCoMa made everything easier, so thanks a lot to everyone involved! Also, the documentations provided is impressive and useful, and the implementation of the package in the main “host” is really well done (with MuBu I often felt like I needed to re-learn how to patch, while with FluCoMa it feels like I’m just using Max/MSP).

Now, I would like to take this opportunity to describe the project I’m currently working on, both because I have quite a few (actually, too many) ideas for it and I’m struggling with the implementation, and to get feedbacks about it and (hopefully) spur an interesting discussion!

So, the main idea is to have a patch than can analyze raw field recordings files, slicing and labeling different parts of them in terms of “compositional” elements (so I’m not really thinking about “dog barking” or “car engine”, but more like “hits”, “gestures”, “textures”, and so on). I would then use these data to train a neural network that, once trained, can automatically do this classification on any new file I add. Then, I would have some sort of algorithmic compositional strategy telling how to treat each sound category in order to get a final piece stemming out from the different field recordings files. So far so good, and it seems like FluCoMa has all the tools I need to implement this. In fact, it seems there’s basically already a tutorial on how to do that!

I have actually started out from the “Classifying sounds with a neural network” help file, although it turns out it works pretty well with simple, easily-discernible-kind of sounds (such as the oboe or trombone used in the example), but once I put a raw 20-minute-long field recording it gets messy. The first issue I have encountered is that it’s hard to get slice points that make sense. I have tried a few objects and parameters, but the main idea I came up with is to use two (or more) slicing techniques simultaneously and then confront their results (possibly using Javascript code) so that I only accept slices which are present (well, reasonably present…) in more than one buffer. I guess this would be pretty straightforward to implement but I have yet to try, so I appreciate any tips about this! Not to mention that coming up with a way of finding good descriptors for “musically interesting” real-world-sounds is another daunting task (I’m currently thinking of mainly using MFCCs as analysis data but, again, I might combine more than one analysis). Then I would have to reduce the analysis data to a more manageable size (again, plenty of tutorials about that!) and use this new data to train the neural network, right?

And, finally, I would have to come up with the algorithmic part for the composition rules, which is something I haven’t even started to think about yet. I just know I’m not really interested in a mosaicing approach (if I understood that correctly) like “find me and reproduce the closest slice to the one that’s currently playing”, but I would say something like “these are the sounds available for this category, apply these processes to them (or to some of them)”.

Does this make any sense? Is there any obvious flaw in the ideal process that I’m overlooking? Am I aiming too high? I feel like FluCoMa is just the right tool to do all this but it’s such a vast field I periodically get lost in it.

Again, thanks a lot! I’m eager to see what ideas you all will come up with.

Hi @Bota,

Welcome to FluCoMa! Your project idea sounds like FluCoMa can definitely be helpful!

If you can post your patch and point to some specifics, that would be really useful for getting some answers going. Maybe we start by looking at the slicer part of the patch since you have some questions about that and it is often a first step in building something like this?

Best,

T

Hey @tedmoore,

thank for your reply!

You’re right, in the initial post I was thinking more about the general idea, so let’s start with some (super easy) code now. The following is the slicing patch:

<pre><code>
----------begin_max5_patcher----------
1375.3ocyZsraaiCEcs8WgfV6oP7kdzUYPGL6lYVMqJJJjsoSTpdXPQmlNEM
e6C0kRNRI1VzhzNAEvJWY83bO7bevq6OmOyeY0i7ZeuO58YuYy947YyfS0bh
Ys1y7KRebUdZMbY9qpJJ3kR+E5uSxeTBm+eq2kl6Uykxrxaq+nWAWJxV4kzc
gk6Jp1Iy4R3wDzd1soxU2otguJ3qjZXDgS9PvBODh9Al5.KnwBq9z6Ks2T1Z
3MVs79eiE064mU183QMm6Wym27wBG6X+0e9oO4kleakHSdWwBuuwEk775r+i
68Dl0c64Yk7UU6JgmA9b4.BMD3ffX3flCHjivAgWbN3ODUa8R81jkyO60Sb.
EPOyfkSJ8h6J+SoZozqNOakR2e1ZSJoA9rnvdqKG0YvWbm4uqdfmK+wTcGRT
zY3NAWb242yympqnSZXlmPhclmbe8trCiUznpHJANPPvADoOZ0OH4O1x02hu
+9upIHrLs.9B+M46xV+gum9.eSkn3I+C3r6yNrMUntMIW7U08uLm2mQeaHhV
82UhHXWEhX4Norp7rohPBPEIvmXZ2mGmGVlVdq+AU2zqheVvqqSuk+pXXc7q
WUSZ1t+VjcaVYZdspd35oFsPhARAoivwXyzH84kCmNFOAeOUJEiI2atF3hk2
I302Uku1e7ddhzK7pZkSyGwnqxZ+P++gTw9vP827YzWLjb1rQ10Y03zCNfXI
8PeCnmQX.cexFnMnQV57Wm7ek7uqdauJsfN+7xcadNwvSd2nxHHVwGlfv6lr
x0uLGxgIS7IXs.TS+lDbaxCrAr1hivbtqcgiPNJZYCW7jUNbRDreIrtBBJHY
bGdSdUprwqOZsD7k1ykdKg+keX+kNVCes6PR0uGaLuEbxEu93AcbT3k1y2J3
a4J0tfmd90D0NuVTGGLwDBHjy7w0p8Gd7MGhGcUTqYCQ5d3mTPJy+MI8sIU2
a2MOlfrKCNh89q70yS+XbBv51aPz2Es2XH0reFQFHMrsyFD48mzvrd65zFV2
dSzaY6McUvK0CjYh0vYQP0rtZ35zh1VCG4t41Lxd9F37Vtqu1QCzsqu3oJKR
b1l9Fqw19d+ns1NjptQ8h5MFYT.57Kjpa3qseWZbnE86FcwELqx4ohEdciqw
RshtIhNoRXzDkJgNSpLoggo8hqyrvdiGQDWNQ1AomEDNBM0g.coSGreqbiqn
Gss3t5.XjSpCLZTMbWvub0K9EAAD1b9g7gNEWK.ZS158LFWyqUE+SkYUk8tF
5fqoHa81prRY6Kjnxao1HGU6wj38FO63h0bwwEjtDoMCL1.nhg4Lzy50XMvJ
rFaBVwmBqQHMeFBYHiXZHGByW1JnEY.zRNExvn38XYOK5BjEZuTDVPQI8ThM
Fud0EewAJIz.jpEhsKtWDgXnCBZLNlwt3alAPsY9GiAUUWnrE89aKUkDpA3J
9TvJjFpG0EbnMvFrb8psQXMxHrB+z+sg5fkqWteQ5uIDt2KtoWkGr6q7XDTO
c.+Pr12x4Z.SBiFOJBqmMcOCKiivQ1pMILPTRBH8zlfksPiYaHdTGOg5Gh2X
YKznNCZjAPyArFxYPSGTzAMr0cRfBsEZsHCG1GYMVt.YHahBnI6wRuDzNBY3
wRxcJjwRf+6hQBeNcLX3BfQL.XnipyBzaLWSYIANCYLmkRiMHkFy5fSD0YPi
1SnAV1BMhCgFa.zrl0LIkVWxE6dSFjF3jMTfnfLlv5ssavX.vziGHc61G3h5
1mNfI+hz6qftMhW.lYkZSFXJ3Oj0c8D3LohU2kI4qj6D5IV8XndhT9EUp9VJ
2k01mkhMTuRXnKMS2pdap1wgYyL+Wy+e.MLwY3A
-----------end_max5_patcher-----------
</code></pre>

I noticed I had previously missed the fluid.bufnoveltyslice~ object, which has given me better results than the fluid.bufonsetslice~ I had originally used (I also have a question about that: what is the unit of measurement for its kernel feature? I understand that by augmenting it we’re taking more time into account, but what exactly does it refer to?).

Anyway, in this patch I’m using them both to experiment with different settings and because, as I wrote earlier, I’m thinking about combining different slice methods (even if it might not even be necessary after all). The patch is super simple and, after some trial and error with the settings, the slicing precision is pretty amazing. My main concern in this area is that I end up with two different scenarios:

  • When using “clean” recordings (such as those coming with the FluCoMa package) I can get really clean slices, but they tend towards the mosaicing feel I mentioned before: lots of very short sounds which would be great to plot onto a map but which aren’t really what I’m going after;
  • I could obviously just tweak the parameters to make the slicing less sensitive and to get longer slices, but when using broader settings or real, noisy field recordings I end up with slices which don’t feel perceptually relevant to me. Especially when there’s a constant background element (e.g. traffic, or cicadas) it’s really hard to get meaningful slices (then again one might of course just consider the whole recording as one long single slice of background elements, since it has this persistent feature, and label it as such, but I’m sure I would slice it and label it differently if I were to do this operation by hand). All of this might of course simply be because I have yet to find the right settings for every single recording I’m trying to slice.

Anyway, I’ve started again from scratch and I’m trying to first do super simple things for now, so my next step is going to be to set up the analysis part and the dataset. I’ll post the updated patch as soon as I will have done it.

In the meantime, thanks again!

Tobia

Hello!

I presume you saw our quite terse reference on the object online?

The unit is in hops - we are comparing consecutive frames of analysis (using the chosen descriptor, from raw fft to mfcc to pitch) but doing a self-similarity matrix on the whole file is not super useful for segmentation, so we have a rolling window, a zoom-in if you prefer, on a smaller subset of ‘locality’ to check ‘local similarity’ with the rationale that a change of some feature is how we perceive change (short term memory). It is not our idea, the original paper is pointed at on our reference site if you fancy it.

So a kernel of 15 will look at 15 consecutive frames and compute the self-similarity matrix and then output a single number of how much change there is in that kernel. that becomes a continuous time series / descriptor that you can observe with the same parameters with FluidNoveltyFeature (I just noticed the plot is not drawing on that page, I’ll fix this soon :slight_smile: )

As with any descriptor’s time series, I like to draw them against the timeline of the sound, so I can play with parameters and see how it behaves in relation to my aural intuition. I recommend doing it here too. You can use the buffer version on 30 sec for instance to see how it all changes and where it might give you what you expect.

Happy slicing!

1 Like