Detecting identical or very similar samples in corpus

Hello to all! I am working with a large corpus, the audio samples were segmented using machine-learning over long audio files, some of them identical or very similar. As a consequence, the corpus has lots of identical or almost identical samples that I would like to filter out.
What kind of analysis should I carry on to identify them? Is there a way to produce and set a “similarity threshold”?
I think the emphasis should be made on the samples which are “very similar/almost identical”. For exactly identical, I guess I would compare a straight forward MFCC plus a length comparison or something like that. Even then, as these samples are very noisy (raw data databent into audio), sometimes the MFCC analysis won’t match (am I talking nonsense here? It is something that has come out from my very little experience).

Additional info: samples range from a few ms (150ms) up to a minute.

Any approaches would be welcomed!

Greetings and thank you!

1 Like

@jamesbradbury work with FTIS may be of interest here:

https://phd.jamesbradbury.net/tech/ftis/

This as well:
https://discourse.flucoma.org/t/segmentation-by-clustering

Ultimately you’ll have to decide what criteria (and thresholds) you use for similarity, and I would imagine morphology/duration/time series being important factors especially if you want to remove sounds that sound identical.

When you say the variations in duration, do you mean that there may be a short file that sounds identical to a long file or is that just the lay-of-the-land in the corpus and you will be comparing similar duration files regardless?

Thank you, once again, dear Rodrigo! (and sorry for the delay answering back). I’m holding James’ work as a Bible right now. But as you said, I may end up having to sort out a way of defining this similarity feature myself. As you suggested, a combination of stats from mfcc and duration might be the right way to take first.

Concerning your question, there are plenty of files with varying durations that “sound the same”. As all of these where automatically segmented from thousands of files, many of which were essentially the same, the segementation algorithm sliced then almost identically. Their variation, however, go from exactly the same length, to a difference of a couple of samples and finally to some hundred of ms. As it the algorithm would have sliced identical information differently thourgh files or points in files.

So, in the end, I think it would be a nice place to start to set a double condition:

  1. if these files are similar, defined by a threshold, in mfcc stats
    and
  2. if these files have similar duration, defined by an other threshold
    then: erase one of them

At this point, the question of setting the threshold is the problem. I could do a straight forward comparison of each mfcc-coeff-stat and evaluate for an absolute difference, but I don’t really know if mfcc works this way. This would be like a very linear approach. For example, if the difference in every coeff is less than 1. then I could consider them as similar.

I hope I was explaning myself clearly, haha.

Greetings, Rodrigo, and thank you very much once more!

1 Like

Ah right, yeah that makes more sense, and should actually be more straight forward.

The reason I asked about duration as I wasn’t sure if you were saying that a 500ms sample could be “similar” to one that is 20s long (same contour, but just over a longer period). That kind of stuff starts getting a lot hairier.

In terms of durations as low as you’re describing, if you are doing summary statistics then it should be irrelevant as it may be just a couple frames difference in the summary itself.

Beyond that something like the dynamic time warping in this object/thread will be handy, but it’s still in the pre-alpha stages. That shouldn’t be necessary for differences as small as you’re suggesting though.