Perhaps not surprisingly, this “envelope extraction” works better on longer stretches of audio.
Here is 100ms windows (my longer “predicted” bit of analysis). At the moment I’m using the same analysis/fft settings (256 64 512), but with a larger overall analysis window (4410).
In this context, I think the mean-per-frame is better than the peak-per-frame. Also with so many frames, I think the overall gesture/contour is the same, with the peak version just being a bit step-ier.
What’s also somewhat surprising is how the centroid behaves over time. I expected a bit more of a rolloff or some kind of change over that analysis window, but it appears to be largely flat. I checked other spectral descriptors as well (not massively thoroughly), but most show similar (lack of) contours. I guess that’s the nature of this being a single “attack” or “moment”, as opposed to some kind of morphology that changes centroid/timbre over time in a more overt way.
There’s also more meat on these bones as well, which could serve well for the kind of contour/envelope analysis discussed in this thread.