Searching for similar sounds in a database - general strategies

Is it possible to use flucoma to analyse a source file and search for similar sound files in sound library? I know that is possible when dealing with small slices of audio and when we dealing with a combination of descriptors and looking for knn. but will it work with entire audiofiles and what are the recommendended steps to accomplish it?
any specific pointers to the flucoma documentation beside @jamesbradbury s corpusexplorer tutorial and @tedmoore s note Learn FluCoMa
(that iam about to read)?

the first and most crucial part (at least for me) is to state more precisely the term similarity. very roughly I would say its the likeness of how 2 sounds (and its spectral components) evolve over time.
a door bang and a glas crash might have in common, that they contain one peak/ transient/onset followed by specific decay time, but differ by peak level, length of decay and wich spectral bands are prominent etc… there are other aspects of the sounds that might make the comparison harder as well. for example its most likely that in both sounds (to keep with the door and glas example) the onsets occur at different times (temp. offsets) and might repeat only in one sound.…
iam just thinking loud here, and would be glad to read other thoughts here, to get a more precise idea of this approach and possible strategies.


1 Like

Hi @johannes

Yes, but it’s up to you how to approach this really. Depending on the material, summarising a whole file down into (say) a single set of statistics may lose too much information, especially if the morphology of those files is an important part of what makes them distinctive. One thing that @tremblap has done for this is to segment files into a number of (uniform?) slices and make a single ‘point’ that glues together separate statistical summaries for each section: that way you can account more for how the sound changes over time. I think there might even be an early thread on here about it – hopefully PA can remember where!

hey @weefuzzy
thanks for your help.
can you please explain what you mean by single ´point`?

Ah, right – skipped ahead, sorry. So, if you’re doing similarity matching with our stuff then the things that are being matched against are entries in a fluid.dataset~, like in the corpusexplorer example. I was calling those entries points (which is what the interface is as well, with setpoint etc).

The point is that the content of those entries can be whatever you want – it doesn’t have to be just a bunch of features (or statistics of features) covering a whole file’s duration, but that you could (say) split your files into three sections, and glue together features for each section into a single list that becomes one dataset entry.

1 Like

Ok thanks for elaborating. this goes a bit over my head. (Emoji with explode brain)

Sure, for the analysis we dont want averaged values for the whole duration of the files but a series of values describing the changing of the sound over time.

So here @tremblap averages/smoothes the data of every slice (with i guess is rather small) 100-200 ms?

The query is than to find the file that corresponds to the most similar series of data?
Or how can imagine that?

I’m destroyed after my gig tonight but I’ll reply tomorrow with loads of details I promise!

1 Like

ok your question is both simple and very deep.

Simply: similarity is in the ear of the beholder. It really depends what sound, in which context, you will find similar. It is always easy to find a sound and itself (or very very near) but as soon as you hit musical reality, you are hitting important questions.

for instance, take a C major piano chord. which of the 3 items below is nearer:

  1. a C# major piano chord
  2. a C minor piano chord
  3. a C major synth pad
  4. a tom sound
  5. a single C note

depending how you analyse the sound, in time, you will get a different answer. More importantly, depending when you use it, all of the above can be right. Now imagine comparing a pichy tom around C vs a single string C… how would you compare them? So deciding what to use to ‘describe’ the sound is what will change how near and far are things. Then, the range of them is important, as you read in @tedmoore’s why scale article.

The deep answer: we listen to sounds in time. So a snare shot, and a snare shot in a reverb, will have different numbers if we are not careful about how we think about time. @jamesbradbury wrote an article on dealing with time which I am yet to finish with the custom idea @weefuzzy is referring to. The code in question, if you want to check it, was in an early release so I can find it back, but I am trying to write an article around it. The ideas is to make 3 stats per slice per descriptor, as sort of envelope. for instance, you could bundle the centroid of the first 50ms, then the next 150ms, then the next 300ms of your slice, and create 3 dimensions that way.

then, you can take that concept further: you can make custom descriptors according to how you care about your material in time. For instance, @balintlaczko and I had such a discussion, and he came up with a set of descriptors that best fitted how he listened to his material in a given corpus. The thread is really informative here

If you don’t mind, let’s continue this thread, it will help me help you and others. Let me know what is not clear, and how I can clarify things more.

1 Like

First thanks for taking the time to give me a more clear idea about all that.

I understand, i think a list of priorities/weighting of chosen descriptors could help to adjust the query to special needs and the very context. If I remember correctly diemo did something similar in catart, right?
For example: if i choose pitch as the main descriptor with the highest weighting, your tom and string might match. if i would choose loudness instead, while both sounds evolve quite differently over time time, they probably wont. With weighting between the descriptors we could decide whats more important for the current query.

As you mentioned musical context. my question for similarity is mainly based on my work as a film soundeditor, its another context but of course with a lot of overlap with music.

Anyway, i already mentioned the search for soundeffects based on similarity - for example to replace layout sound effects from the editor. another I guess even harder approach would be to find alternative takes of a specific dialogue line (for example “you ask to much questions here!”) because the edited take is to noisy or is not intelligible. for now the workflow in both cases is to search a database of sound files by metadata. in the first case its like searching keywords (also a combination of keyword to narrow the search eg. bell, small, damped)
In order to find an alternative dialogue line with the same words I would search for scene and take and click and search manually via waveform section until I find what I need - mostly something that sounds as close as the original. depending on the size of database this can take some time and nerve. so I wonder if flucoma could help me to simplify my search, returns some matches from wich in can choose from.

if you found time or once you finish your article I would like to see how it works.
not quite sure I get the idea you described or why you do these things … why not simply makes smaller slices instead… why 3 section - over the the duration of an entire soundfile or will there be 1000s of the slices with each 3 sections?

will read the links you mentioned…
again thanks for all the input. all the best. j

1 Like

There is a weighing indeed in CataRT, which you can implement in FluCoMa if you want, but that will not solve your real questions: when to compare sound and how to shrink time - do you average per second and compare consecutive seconds? do you take the maximum of the loudness, and the average pitches? etc. Comparing sounds is a very complex process we do as humans. Machines, apart from single shots on drums, which is much simpler because they have predictable features, is a complex problem.

I suggest 2 ways forward:

  • first, out of FluCoMa-land, try CataRT and AudioStellar on your sounds - both can get you somewhere in having a feeling of what is not working in those implementations for you.

  • then let me know what is not working for you, then we can try to dig in that process together. I really value helping people engage sonically with this question. It is a complex one!