I’m trying to draft a patch that will give me a closest match on slices of a spoken voice sample from a corpus.
I am using only MFCCs at the moment but I’d like to add some more dimensions … maybe some spectral moments (which ones…?)
Does anyone have experience with this? What are good descriptors to use in practice?
any pointers are much appreciated!