Noise classification for cleaning large corpora

Hello all there! I am working on a project greatly inspired by the amazing work of @jamesbradbury and his Mosh and FTIS ideas. Very early, I have run into some problems and would appreciate any experience/approach/suggestion regarding them.

At the first stage of this project, which is itended eventually become an electroacoustic piece (hopefully), I am trying to build a large corpa of databent audio. Following Jame’s mosh, I wrote a very simple python script that does the same (converting every binary file to a valid .wav). I have tried converting very different files and got some interesting results. However, as it is excpected, there is also an overwhelming amount of audio noise. I would say the ratio is like 97% noise and 3% “nice sounds” I would consider using. It could be even less, though.

In order to build the corpora consisting of these kind of sounds, I want to filter all these noisy parts out. Because the ratio is so small, I need large amounts of data (just for the sake of testing, I have bent almost 22gb of files, but I would like to scale it it even further).

To achieve this, I patched an analysis loop in Max. I used fluid.bufnovelty with the spectrum algorithm (maybe mfcc would be better?) to slice every file into “differentiated spectrum” chunks. So far, this proved to work just fine. Of course, there are some things that get lost (i.e., “nice sounds” that get sliced together with noisy parts).
Then, I analyse every slice with fluid.bufstft, fluid.bufmfcc and fluid.bufspectralshape (then get the stats of all of them) and store into different fluid.dataset. I am passing the “absolute_file_path position_in_samples” as an identifer to keep track of every slice.

I’ve got so far and now the problem arises: how should I classify all the slices into “noise” and “not-noise” slices to filter them out?

My first straight forward approach is to train a MLP for classification and label every slice. Then simply batch process all the audio files in python to cut the noisy parts and leave only the good slices. This should dramatically decrease the size of the corpora and also make it more managable for later use.

However, I am skeptic about this being the most effective way to do this. If so, is there any other approach I could try? What type of analysis should I use to classify noise better? I was using fft (as noise is tipycally described as “even power across all the spectrum” I thought this would be more straight forward). Or simply perhaps there’s a totally different strategy someone else would try to get this working.

I share the .py and the max patches just in case someone wants to give them a try. I am open to any suggestions and critiques are very welcome as well!

Thank you very much to all and have a great weekend

Archivo.zip (18.6 KB)

now. this is your biggest challenge: my noise is your music, your noise is my music. in other words, noise is contextual.

If you mean voiced/unvoiced, monophonic pitch/dense pitch, etc there are dozen of attempts at generalising this, none of which working for me. What you describe is spectral flatness. A log version would be good - and we have one in fluid.spectralshape. its average and standard deviation on a slice (but how do you slice?) might get you somewhere… but I think you’ll get a lot by exploring these various parameters (how to slice, how to describe time, how to shrink) and it might help you find a solution that works for a given sound bank at a given musical moment.

Describe what you hear as noise and not-noise, and I might be able to point at specific ways to tackle those questions I have come across. I very much enjoy these explorations.

p

Dear @tremblap, thank you so much for answering back. I apologize, I should have explained this “noise” defintion better. I attach here a few samples of what I consider the type of noise I want to cut out.

samples.zip (2.2 MB)

Again, I’m sorry: the .zip files contains: a “noise definition”, “nice sounds” I would like to keep (I think this might be what you meant by monophonic pitch/dense pitch?).

And then there are 2 sample of the raw files I get after bending data. This is the kind of files I am feeding the analysis loop with.

And again, thank you!

I sense you may have read this chapter of my thesis already:

https://phd.jamesbradbury.net/projects/reconstruction-error

I tried the MLP approach (in Python those days) and it worked pretty well, but not that well. I also found the binary classification approach a bit of a dead end in that, yes, okay – I got rid of some noise, but I still had an unruly amount of stuff to work with at that point, so I hadn’t really made me life any easier. Also, I ended up finding “noise” classified sounds which manifested as something compositionally fruitful (to my ears at least). That’s this track, “sys.ji”. If I had continued to remove things in an attempt to only be left with “interesting” bits then I wouldn’t have made that track.

SO I suppose buried amongst the previous paragraph is maybe an encouragement to think about how you might analyse the noise with the non-noise (however you may define that) and then figure out a way that you can actually parse that analysis (visually / aurally) very fast. For me this was doing the descriptors → stats → dimension reduction → visual mapping pipeline, along with some clustering. Rapidly auditioning clusters with a Max patch put me into a state of compositional readiness probably 2 magnitudes faster than trying to design the perfect system that could target “non-noise”. Again, I write about this in detail in the thesis chapter, but maybe you’ve already read it. Hope that adds some texture to it if you have.

1 Like

Oh, and also think about how you can easily get rid of files to reduce the processing time of all your analysis. Some things to think about:

  1. Too short
  2. Too long
  3. Too loud
  4. Too quiet
  5. Too consistent
  6. Not consistent enough

That gave me a 50% reduction in my experience :slight_smile:

1 Like

Thank you very much to both for your answers. I took some time to try all these ideas and I think there was a major improvement!

Following, @tremblap 's observation, I used the spectral flatness coeffienct with a very high threshold (-2) to get rid of the most “noisy” samples and avoid analyzing those. Also, at this pre-analysis level I also filtered out following @jamesbradbury advices (one question, though: what do you meant by “consistent” and how do you measure it?).

Now I’ve got another problem. After going through large amounts of files, there are many slices which are identical or almost identical. In a next step, I would like to filter them out. i.e, to filter very similar slices and just keep one version of them. I guess this would help to further reduce the amount of data.
How should I tackle this?

I’ve noticed that using umap with very narrow numneighbors (2 or 3) and mindistance (0.1 or less) groups most of these identical samples into very separate groups. Without knowing very much about it, it seems a good approach to find these groups and filter them out except for 1 sample of each. But I don’t know how to implement this. I would like to have it automatically done, instead of manually going over the groups in the plot. Where is that information about the distance of every point encoded? In the umap dataset (normalized) or in the kmeans algorithm?

Again, thank you very much! I’m super excited about the results so far for someone like me who know very little about all this great world of audio processing :grinning:

1 Like

it depends how many points you have in the whole dataset… I presume there is a way to read the KDTree in a methodic manner to do that, but I wouldn’t know how. If @weefuzzy is around, he has picked some fun fights with kdtree so might be able to give us an insight - because that is what a kdtree does, to organise material by proximity. I know @jamesbradbury also had fun with that so he might have a hunch too.