What’s surprising is the highest output (here) is in the middle of the overall analysis window, where the sharpest spike is happening at the start.
Actually, looking at the tiny fragment of audio (256 samples), it doesn’t show much of a morphology:
So this is, I guess, the tiny start of what I drew out. And the contour I’m seeing is probably “correct” in terms of perceptual expectation (like a 2ms “onset”).
I can go for a time-domain thing, for loudness, but the benefit of this would be having it time-aligned with other descriptors for the same window of time. So at x point in time, the loudness/centroid/etc… where [list of values]. I could S&H a time-based envelope follower, but that’d still leave me in a similar boat for spectral analysis. Though folding may be useful there.
I haven’t yet tried smaller hop size (I’m presently using @fftsettings 256 64 512
), but I feel like I’m nearest the squeezing water from a rock point with such a tiny analysis window.
I can, perhaps, push this kind of “envelope extraction” onto the predicted analysis stuff, where I’ll (hopefully) have 100ms of audio to analyze. What a luxury!