Applying NMF to separate overlapping voices

rodrigo.constanzo · September 22, 2018, 10:58pm

Rather than mocking up a file to test this, just check this audio file into the fluid.bufnmf~ help file. Given how different the voices are in character, dynamics, and pitch register, you would (perhaps) expect some separation that is somewhat content aware, but what I get instead is five different streams which all more-or-less contain the same sonic objects, just split up (somewhat) spectrally.

http://rodrigoconstanzo.com/temp/zornVoice.zip

weefuzzy · September 24, 2018, 9:56am

<tl;dr> NMF isn’t at all aware of its content, so you will come across oddities like this, where the decomposition doesn’t match our perceptual expectations. Sometimes the answer can be to constrain the number of things it tries to decompose to (e.g. try the voice with rank = 2: closer, but still a lot of interference), or to go the other way and select what you know to be far too many, and then try and work out how they group afterwards. I think we can produce some stuff to help with the latter tactic reasonably quickly. A third possibility is to steer the algorithm by seeding its filters and / or envelopes.

NMF tries to solve a problem to which there are, in general, multiple solutions: what combination of spectra and envelopes could describe this particular spectrogram. The default behaviour is to start with random data and iteratively move that in a direction that reduces the error between the guess and original. However, because there will be many, equally plausible, solutions there is no gurrantee that what it converges on will sound plausible to us. Sometimes, as you’ve found, it just seems to behave as a slightly weird filter bank.

In NMF’s basic form (what we have here), the algorithm has no sense of many things that we might take for granted, like temporal continuity. Whilst it can resolve overlapping things, it might well require some strong hints about what the things are. One possibility is to seed it with some starting filters / envelopes in the hope that this starts it off in a place more likley to converge on what seems sensible. A tactic for this could be to perform NMF with rank = 1 on some isolated passages that seem to capture the the general sources you’re after, and combine the filters from these to pass into a run on the full audio. I’ll try and knock up an example later to test this notion.

rodrigo.constanzo · September 24, 2018, 10:10am

Yeah that makes total sense, and obviously the algorithm wouldn’t know what I wanted as separate.

I wonder if it would be possible (technically) to have a UI-ish thing where you can roughly define areas (ala Frederic’s last project) and that could then generate an approximate/appropriate query for descriptors, which will then feed what and how to recombine things).

I would class that as a pie-in-the-sky, non-functional(ish) idea, but it would be a musically useful way to intervene in the algorithm as a human/musician.

weefuzzy · September 24, 2018, 12:05pm

It would be totally possible for NMF. For given settings the sizes of the seeding filters / activations needs to be exactly right, but that would be quite simple to embed in a gui-ish thing, I should think.

tremblap · October 1, 2018, 2:25pm

The gui idea is linked to the discussion with descriptors… Have you tried overdividing then checking different timbral descriptors on the different reconstructed ranks, link in my guitar pick demo (3rd tab of bufnmf~) ? Looking at clustering similar ranks is something we have started on, and should eventually give birth to some objects, but in the meantime feel free to experiment and tell us what descriptors make more sense to you in this quest.