Question: voice source separation

Hi guys!
What would be the best strategy to separate a soprano from an orchestra (mixed together in a sample)?

As a utopic goal, I would like to have one buffer containing the whole voice and one buffer the whole orchestra. I know that reality is harder than this :slight_smile: And I know that neural networks are better suited for this task than nmf, but fluid.bufnmf~ is so much handier :wink:
I’ve tried it with different parametrization (and ranks), I haven’t been able to isolate notes reasonably well (the vibrato does not help, also the sample is timestretched but that does not seem make a difference), although I think I can manage to simulate more or less what I want (by splicing together different portions of processed buffer). I was wondering if there was a better strategy.

Here’s a portion of source sample

I’m starting from the “basic example” tab in the help file, modifying rank (from 2 to 10), iterations (increased to 150), and analysis parameters (tried to increas them, unsuccessfully).

If anyone has a hint to make it better, that would be cool!
In the meantime, I’ll still keep trying to tweak parameters :slight_smile:


Hi @danieleghisi, good to hear from you!

This may well be beyond the talents of NMF, especially if you don’t have some isolated samples to make templates from. Basic NMF like this can’t group different pitches from the same source, so you’d want as many components (set with rank) as each source has discrete pitches (and, yes, vibrato complicates this).

Unfortunately, I can’t play your sample for some reason, so it’s hard to offer any concrete suggestion for this particular problem. When @groma has finished travelling, he might have some ideas too.

Hi @weefuzzy, thanks for your tips! I imagined this could be the case :slight_smile:
As for the soundfile, it’s weird, indeed: I just copied a dropbox link. If you right-click and copy the audio address, you may paste it somewhere else. In any case there’s no need for you to hear it, what you say is already very clear and reasonable :wink:

thanks again,

I know you are not allergic to code and I have been playing with this recently - perhaps its of interest to you.

Thanks @jamesbradbury , I didn’t know that resource. The method is nice and clever – though unfortunately it doesn’t sound that great on my example…

I’m under the impression that the best results in vocal source separation are achieved by deep networks, but I’m not aware of one ready-to-be-used for this.

I think I’ll get by with the chunky nmf for now, or with manual masking, no big deal :slight_smile:

Again, thanks to all of you for your pointers!

The trouble there is, of course, training them, as you need both time and data in great quantities! A lot of the research-code networks can get good results, but I think (@groma correct me) that a lot of the available ones don’t share their trained weights, and are often predicated on vocals, guitars, bass + other training data.

Hello gang,
Coming late to this conversation – but to add that Principle Latent Component analysis - esp. 2D and shift invariant is great for this.
A minimal interactive interface up here (not by me, but based on our paper
The hyper params have been tweaked for soundscape rather than music

These models based on Michael Casey’s Bregman tool kit



the link is dead for me, is it possible the server went down?

Somehow I missed this thread, but certainly using neural networks would give best results. I guess in the context of the fluid decomposition toolbox it would be interesting to combine NMF with pitch tracking and/or onset segmentation so you can do the notes separately, depending on the level of automation needed…