Separate vowels / consonants

jmmmp · April 8, 2021, 6:53am

Hello list,

I’m working on an installation and I’m looking for a way to process an audio buffer in Pure Data so that in a recorded voice, vowels and consonants are sent to separate outputs. I wanted to narrow down the search, and I imagined that this is something that can be done by separating harmonic and noisier spectra? Could you point roughly which of your documentation / objects I should have a look at?

Here is some loose data for the project:

solo female voice, no other sounds
spoken language is Portuguese (very rich in phonemes and colors), although could work with any language
audio is pre-recorded, could also be pre-analysed to help the processing
I would prefer processing to be real-time, because it will be controlled randomly. But f too taxing on cpu, it’s conceivable to have preprocessed files ready as well.
CD audio quality/settings, each file around 1m long
surely it won’t be possible to do a 100% clean cut between the 2 types of sounds, but ideally when playing both tracks together, the original file would sound.

Best,

jmmmp

tremblap · April 8, 2021, 7:37am

Hello and welcome!

There are a few objects and approaches I can think of in the current toolset, and more to come with the machine learning stuff.

the first approach is a classic pitch to noise extraction, which can be done with the fluid.sine~ object. It estimates peaks and resynthesise them, yielding a separate residual which null-sums. The object needs oversampling the FFT to sound good, but all fft parameters, and all tracking parameters, are available to you in the helpfile and pointers to papers too.
a variation on that first approach would be to use fluid.hpss~ which does harmonic/percussive separation. I’ve had good results to isolate noisy part of signal like violin with it. there are various modes too, some binary and some with soft masks, with various types or artefacts. All of the outputs nullsum again (a bit of an obsession for us). Again fft settings are quite important here but can be explored in real-time, as well as the filter sizes.
a more daring approach, which might not work for voice but is worth trying, is in the example folder, where nmf (non-negative matrix factorisation) is used to split the sound in archetypes (bases) which are then classified. I used a guitar to isolate the pick part quite successfully. Maybe you can train it on a 10 second section of a given voice, isolate the noisy parts, and use nmffilter~ to split in real time the various elements.
another approach is more garage and dirty, and a favourite of mine for effect sends for instance: I use the a pitch analyser’s pitch confidence output to drive a cross-fader. I can keep the noisy (low confidence) part out of a lush delay for instance, and just send material above a certain pitchness.

I hope all of this helps. There are examples of all of these in the helpfiles of the 4 mentioned objects, as well as in the Example folder.

Happy coding!

rodrigo.constanzo · April 10, 2021, 5:24pm

If I understand what you’re trying to do, I think this may be one of the best options, as well as one of the easiest.

It’s been my experience that many of the “decomposition” algorithms, unless perfectly/painstakingly tuned, will tend to give you variations on your input, so you may end up with a stream of stuff that is mostly vowels, and another that is mostly consonants, but neither being exclusively so.

By gating/panning via confidence, or other descriptors (spectral flatness may be good too), you will get clearly separated sounds. That may not be accurate as speech is complex, but it will definitely give you “two things” that you can then pan around.

tremblap · April 11, 2021, 1:19pm

indeed, but with the programmatic approach, one could use, for instance, the centroid running average of the noise output to tweak thesholds of HPSS, or other type of creative coding feedback principles. Sky is the limit! Actually, number of hours per day is my biggest limit right now

jmmmp · April 16, 2021, 10:22am

Thanks. I finally had a try, and indeed fluid.bufpitch could be a first step, but more are necessary.
As suggested, my first approach would be through gatting/panning, and to avoid any synthesis. I’ll see if adding analysis layers could help separating the audio a bit more.

JorgeBlank · April 22, 2021, 10:45am

Hi Jmmmp, out of curiousity, what operating system and platform are you using?

I ask because I am having some issues on a new mac running Pd, whereas I was able to get the objects working with an older Windows machine. I wondered if it was a problem for others or perhaps a configuration issue on my side.

** also, unrelated: are you the same Jmmmp responsible for the classic and very amazing Pd library of the same name?
just curious.

best,
Jorge

jmmmp · April 22, 2021, 11:10am

I use windows 10. There have been some issues with Pd in the newest macos, as Pd isn’t a “trustful developer” which wants to pay 100US a year to apple so that it’s allowed to use it’s magnificent system. The list might have more details on it, in case that is the issue.

I’m that jmmmp, and so the only one I know of. Also responsible for the even more amazing but less classic Click Tracker by João Pais