Searching for similar sounds in a database - general strategies

johannes · November 27, 2022, 12:57pm

Hello,
Is it possible to use flucoma to analyse a source file and search for similar sound files in sound library? I know that is possible when dealing with small slices of audio and when we dealing with a combination of descriptors and looking for knn. but will it work with entire audiofiles and what are the recommendended steps to accomplish it?
any specific pointers to the flucoma documentation beside @jamesbradbury s corpusexplorer tutorial and @tedmoore s note Learn FluCoMa
(that iam about to read)?

the first and most crucial part (at least for me) is to state more precisely the term similarity. very roughly I would say its the likeness of how 2 sounds (and its spectral components) evolve over time.
a door bang and a glas crash might have in common, that they contain one peak/ transient/onset followed by specific decay time, but differ by peak level, length of decay and wich spectral bands are prominent etc… there are other aspects of the sounds that might make the comparison harder as well. for example its most likely that in both sounds (to keep with the door and glas example) the onsets occur at different times (temp. offsets) and might repeat only in one sound.…
iam just thinking loud here, and would be glad to read other thoughts here, to get a more precise idea of this approach and possible strategies.

peace

weefuzzy · November 30, 2022, 10:21am

Hi @johannes

Yes, but it’s up to you how to approach this really. Depending on the material, summarising a whole file down into (say) a single set of statistics may lose too much information, especially if the morphology of those files is an important part of what makes them distinctive. One thing that @tremblap has done for this is to segment files into a number of (uniform?) slices and make a single ‘point’ that glues together separate statistical summaries for each section: that way you can account more for how the sound changes over time. I think there might even be an early thread on here about it – hopefully PA can remember where!

johannes · November 30, 2022, 8:15pm

hey @weefuzzy
thanks for your help.
can you please explain what you mean by single ´point`?

weefuzzy · December 1, 2022, 8:56am

Ah, right – skipped ahead, sorry. So, if you’re doing similarity matching with our stuff then the things that are being matched against are entries in a fluid.dataset~, like in the corpusexplorer example. I was calling those entries points (which is what the interface is as well, with setpoint etc).

The point is that the content of those entries can be whatever you want – it doesn’t have to be just a bunch of features (or statistics of features) covering a whole file’s duration, but that you could (say) split your files into three sections, and glue together features for each section into a single list that becomes one dataset entry.

johannes · December 2, 2022, 10:21pm

Ok thanks for elaborating. this goes a bit over my head. (Emoji with explode brain)

Sure, for the analysis we dont want averaged values for the whole duration of the files but a series of values describing the changing of the sound over time.

So here @tremblap averages/smoothes the data of every slice (with i guess is rather small) 100-200 ms?

The query is than to find the file that corresponds to the most similar series of data?
Or how can imagine that?

tremblap · December 2, 2022, 10:27pm

I’m destroyed after my gig tonight but I’ll reply tomorrow with loads of details I promise!

tremblap · December 3, 2022, 11:47am

ok your question is both simple and very deep.

Simply: similarity is in the ear of the beholder. It really depends what sound, in which context, you will find similar. It is always easy to find a sound and itself (or very very near) but as soon as you hit musical reality, you are hitting important questions.

for instance, take a C major piano chord. which of the 3 items below is nearer:

a C# major piano chord
a C minor piano chord
a C major synth pad
a tom sound
a single C note

depending how you analyse the sound, in time, you will get a different answer. More importantly, depending when you use it, all of the above can be right. Now imagine comparing a pichy tom around C vs a single string C… how would you compare them? So deciding what to use to ‘describe’ the sound is what will change how near and far are things. Then, the range of them is important, as you read in @tedmoore’s why scale article.

The deep answer: we listen to sounds in time. So a snare shot, and a snare shot in a reverb, will have different numbers if we are not careful about how we think about time. @jamesbradbury wrote an article on dealing with time which I am yet to finish with the custom idea @weefuzzy is referring to. The code in question, if you want to check it, was in an early release so I can find it back, but I am trying to write an article around it. The ideas is to make 3 stats per slice per descriptor, as sort of envelope. for instance, you could bundle the centroid of the first 50ms, then the next 150ms, then the next 300ms of your slice, and create 3 dimensions that way.

then, you can take that concept further: you can make custom descriptors according to how you care about your material in time. For instance, @balintlaczko and I had such a discussion, and he came up with a set of descriptors that best fitted how he listened to his material in a given corpus. The thread is really informative here

If you don’t mind, let’s continue this thread, it will help me help you and others. Let me know what is not clear, and how I can clarify things more.

johannes · December 4, 2022, 8:00pm

First thanks for taking the time to give me a more clear idea about all that.

tremblap:

Simply: similarity is in the ear of the beholder. It really depends what sound, in which context, you will find similar. It is always easy to find a sound and itself (or very very near) but as soon as you hit musical reality, you are hitting important questions.

for instance, take a C major piano chord. which of the 3 items below is nearer:

a C# major piano chord

a C minor piano chord

a C major synth pad

a tom sound

a single C note

depending how you analyse the sound, in time, you will get a different answer. More importantly, depending when you use it, all of the above can be right. Now imagine comparing a pichy tom around C vs a single string C… how would you compare them? So deciding what to use to ‘describe’ the sound is what will change how near and far are things. Then, the range of them is important, as you read in @tedmoore’s why scale article

I understand, i think a list of priorities/weighting of chosen descriptors could help to adjust the query to special needs and the very context. If I remember correctly diemo did something similar in catart, right?
For example: if i choose pitch as the main descriptor with the highest weighting, your tom and string might match. if i would choose loudness instead, while both sounds evolve quite differently over time time, they probably wont. With weighting between the descriptors we could decide whats more important for the current query.

As you mentioned musical context. my question for similarity is mainly based on my work as a film soundeditor, its another context but of course with a lot of overlap with music.

Anyway, i already mentioned the search for soundeffects based on similarity - for example to replace layout sound effects from the editor. another I guess even harder approach would be to find alternative takes of a specific dialogue line (for example “you ask to much questions here!”) because the edited take is to noisy or is not intelligible. for now the workflow in both cases is to search a database of sound files by metadata. in the first case its like searching keywords (also a combination of keyword to narrow the search eg. bell, small, damped)
In order to find an alternative dialogue line with the same words I would search for scene and take and click and search manually via waveform section until I find what I need - mostly something that sounds as close as the original. depending on the size of database this can take some time and nerve. so I wonder if flucoma could help me to simplify my search, returns some matches from wich in can choose from.

if you found time or once you finish your article I would like to see how it works.
not quite sure I get the idea you described or why you do these things … why not simply makes smaller slices instead… why 3 section - over the the duration of an entire soundfile or will there be 1000s of the slices with each 3 sections?

will read the links you mentioned…
again thanks for all the input. all the best. j

tremblap · December 5, 2022, 1:15pm

There is a weighing indeed in CataRT, which you can implement in FluCoMa if you want, but that will not solve your real questions: when to compare sound and how to shrink time - do you average per second and compare consecutive seconds? do you take the maximum of the loudness, and the average pitches? etc. Comparing sounds is a very complex process we do as humans. Machines, apart from single shots on drums, which is much simpler because they have predictable features, is a complex problem.

I suggest 2 ways forward:

first, out of FluCoMa-land, try CataRT and AudioStellar on your sounds - both can get you somewhere in having a feeling of what is not working in those implementations for you.
then let me know what is not working for you, then we can try to dig in that process together. I really value helping people engage sonically with this question. It is a complex one!

moonpalace · January 7, 2024, 9:50pm

Reviving this old thread because I’m in a very similar situation. I wonder if @johannes got any luck with this.
I’m a sound designer and I daily work with huge sound libraries with terabytes of sound effects, from a 2 seconds door slam to a 10 minutes field recording in the middle of the tropical jungle. The idea is to have a tool that starting from an interesting sound I made I can find similar assets in my sound libraries.
I know, similar is a very subjective term but what if my whole sound bank is analyzed following precise descriptors (fft, mfcc, loudness, transients detections…) and then it’s a matter of feeding the sound I designed and see what it comes out of it.
The process here has very little to do with pitch and music domain.
Cheers

johannes · January 8, 2024, 12:01am

I haven’t follow the topic since than. but take a look at explorer from audio particles.

tremblap · January 8, 2024, 9:32am

Hello @moonpalace

First question: did you try audiostellar? it is a prepackaged solution that might help you get into the process and know what you are trying to do on a more technical level… a list of what you don’t like about it is already a good start to code the solution, or even to hire someone to do it for you.

That said, @fearne is working on a stand-alone prototype at the moment and might have another proposal - still, starting with audiostellar is a good place.

Hint: don’t start with a terabyte. Start with a representative subsample of the database. Analysis takes time, so being agile at understanding what you want by trial and error is faster if you have a short bank to start with.

good luck!

fearne · January 8, 2024, 1:40pm

Seconding Audiostellar, it’s definitely the nicest/most user friendly solution in this space that I’ve come across so far. I’ve never tried it with a soundbank bigger than a few gigs, so I’m not sure how well it plays at such scales as @moonpalace is interested in.

moonpalace · January 8, 2024, 5:23pm

Thanks a lot @tremblap and @fearne for the suggestions,
I didn’t know about the existance of AudioStellar, I’ve quickly tested it this afternoon and it seems very cool!
Unfortunately I couldn’t find the feature I’m looking for that is feeding the tool with a sound I like (or I’ve made) and give me the most similar sounds from a specific soundbank (the sound I use as feeder is not in any soundbank, is unknown to the tool). But maybe I need to dig more into it.

fearne · January 8, 2024, 5:58pm

Ah yes, audiostellar doesn’t have something like that unfortunately from the looks of it. I feel like it shouldn’t be too difficult to implement a drag and drop to make the tool analyse one sample and insert it into the existing grid.

From a quick glance at their gitlab issues list, doesn’t seem like that’s ever been requested, might be worth raising it (in fact I might even get around to it myself when I have some time, since this is a feature I’d be interested in also).

moonpalace · January 8, 2024, 8:49pm

Unfortunately my code skills are limited but I will try to figure it out.
I’ve also dropped a request in their forum.