Voice processing

GabeChan · October 15, 2021, 8:25am

hey guys

I’m trying to draft a patch that will give me a closest match on slices of a spoken voice sample from a corpus.

I am using only MFCCs at the moment but I’d like to add some more dimensions … maybe some spectral moments (which ones…?)

Does anyone have experience with this? What are good descriptors to use in practice?

any pointers are much appreciated!

best
Gabriel

jamesbradbury · October 15, 2021, 10:48am

Hey Gabe,

Nice to hear you’re back on the matching problem! Let me propose some things to you:

Have you tested with your ears to evaluate the matching process ?
What about the current matching doesn’t seem to align with how you think it should. Is it the envelopes (morphology), tonal characteristics, a specific behaviour in a frequency band? All these questions are open-ended and up for grabs but if MFCC isn’t working it might be that you’re listening to something specific, rather than something generic (which is kinda what MFCC ends up being).
What kind of data processing are you doing for the matching i.e normalisation, standardisation, dimension reduction. These might help to improve your matching or potentially worsen it!

I’ll remind you of some tricks to code by:

Know your data
Use your ears
Sometimes less is more for the computer

If you have some source sounds and/or patches as well that’d be great. Maybe we could hack something asynchronously together or throw some ideas back and forth!

GabeChan · October 15, 2021, 3:34pm

Hi James!
yes, I’m back at it!
thanks so much for getting back so quickly and in such depth, I find your suggestions very useful as a way to think about the problem.

I still remember us talking about the same problem in the workshop, so my question was more a kind of general question because I wanted I know which directions I could investigate in, for example if someone said ‘try spectral skewness’, I’d have a starting point, because at the moment it’s a lot of stuff at the same time, all the coding and chopping and parameters that feed back into each other is a little overwhelming. hence I thought I post the question here and see what comes back.

let’s do it like that, I’ll try to post something within the week!

jamesbradbury · October 15, 2021, 4:14pm

I would suggest to try something and not worry too much about for now. Starting with MFCC is a good bet for voice sounds that are not spoken in a tonal language. If you find its not doing what you need its a place to explore different descriptors, statistical curating and hybridising things. Let me know how you go