Revisiting a pretty old subject here, but in searching on the forum, it doesn’t look like any of it took place on here.
Back at the second plenary there was a bunch of discussion between @tremblap, @a.harker, @weefuzzy, and myself about what to do when you’re analyzing a small number of samples and want the best representation of what’s in there.
As a point of reference, I’m talking about analyses like these:
@windowsize 256 @hopsize 64 @numframes 256
So in a situation like this, you are returned with only 7 analysis frames. So taking the vanilla mean from
fluid.bufstats~ probably isn’t enough to remove the impact of the edge cases here, particularly since the only option here is zero-padding.
The intended use case here is that I’m trying to shorten the ‘onset descriptors’ analysis window down to 256 samples instead of 512, since I’m now doing some extra analysis stuff, and further post-processing. At the best of times my latency here is ca. 11ms, but now it’s getting closer to 15ms+, which starts feeling laggy.
From my vague recollection, what is “best” (generally speaking) varies on the type of descriptor in question.
For centroid, I believe mirroring (vs zero-padding) can be better because it keeps the spectral emphasis more-or-less in the same place (though that’s not possible here). For something like pitch, throwing out the outer frames is desirable(?) since you don’t know what is in there. For loudness doing something similar to avoid overly weighing the “zeroes” that surround the analysis window.
Is this about right?
I’m not sure why this is the case, but in my patches from around the second plenary (which I’ve just tweaked as we’ve carried on) I’ve been keeping the first 7 frames loudness (although from 512 samples, so 7 out of 11 frames), and frames 3-8 for spectral moments (again, out of 11).
So that doesn’t really line up with my memory of stuff (hence this thread, and generally inquiry). The spectral moments “middle frames” makes sense, but not sure why I would only take the first frames of loudness. I guess to try to capture the initial “oomph” of the sound only? With only 256 samples it’d be all “oomph” I suppose.
Do MFCCs and mel-bands fall into the spectral descriptors thing, where it would (perhaps) be better to grab the central frames?
Another thing that occurred to me is that I may want different amount of frames for different statistics. Maybe for the mean it would be good to be more selective about the frames to not overly weigh it towards the zero-padding, but perhaps I do want the zero-padding-ness for the derivative, since that may capture more of what the actual morphology of the analyzed segment was?
I guess I can always ‘double dip’ the
fluid.bufstats~-ing by taking some stats from the whole window, and other stats from just the middle etc…