't' 'd' 'b' 'g' (etc) sounds - tracking

I’m building a patcher that tracks and translates speech/voice (can be singed) sounds into movement of shapes on a video.

let’s think of it as an audio visualiser but focused on the human voice.
vowels are meant to move one part of the video, consonants are meant to move another…

the vowel part of it reads very well and im managing to get the most noisy and sustained consonants ‘‘s’’ - ‘‘f’’ (tracking the spectral centroid) but when it comes to other (Specially shorter) consonants it doesn’t always work.

anyone can speak of other descriptors or combination of them to help me track all consonants?

thanks ±

hello!

This looks interesting. It might be that your spectralshape’s window is too large. What are you fft settings?

another idea could be to use onset detection. They are all plosives so you could get lucky there too. again, short window will help catching details.