Edge cases in analysis frames

You are right it gets philosophical, and that it is only what you think is better. If someone like me think the other is better, the current implementation allows you do to both. the other way round wouldn’t. Ergo, a superior interface design! This is as if we thought about these 2 options… :smiley:

But your drawing in excel might help people understand what is happening. Which is a good contribution, so thanks for that!

Well, options to choose would let you do whatever you want! At present it’s not possible (without some tedious workarounds) to do any kind of mirroring, would be the other most common thing to do with the edge frames.

A question.

So I’m trying to do the same thing as above, but with spectral descriptors (centroid in this case), but I’m getting unexpected (somewhat) behavior.

It looks like if you analyze digital silence, you get Nyquist/2 as the default value. I guess it has to return something, and this is, I suppose, no more incorrect than any other value that is returned.

But I noticed that if I did what I’m doing above with loudness (analyzing 640samples, when I only want 256), the outer frames, if there is silence in the buffer, which there is in my test context, are filled with 11k:
Screenshot 2020-10-17 at 3.47.09 pm

That’s not always the case, as it seems to depend on what else is going on, and what else may be in that bit of buffer etc…, but contrary to my expectations, this approach of taking a larger analysis window is biasing my queries upwards towards Nyquist/2.

For example, here is fairly bright sample hit, with a mean around 9k (which sounds about right):
Screenshot 2020-10-17 at 4.00.14 pm

I have no idea what’s going on in the frames before the attack, though the ones at the end seem in line with a decrease in centroid over time.

My question is, what happens with the zero padded equivalent of this? I have no way of seeing the “zeros” that are put into that part of the analysis since it happens behind the scene, but if I do this:
0eff07a41c175ca8b9f49aa443438eb8c7df7399_2_690x187

I don’t seem to get results that are in line with those light green frames being filled with 11k. So is the “zero-padding”, with regards to spectral desriptors, doing something other than filling them with Nyquist/2? Are those light green frames filled with a centroid value of “zero”? That would be more in line with what (I think) I’m seeing, with the mean getting pulled down by zero padding.

What is in those first few hops when zero-padding an analysis with spectral frames? Nyquist/2 or zeroes?

Also, with this it occurred to me that I can also have a double-buffered “reverse” audio stream going at all times (driven by the same count~) , so when I detect an onset, I can fluid.bufcompose~ the frames I want (dark green), and then do two copies from the reversed buffer into the the mirrored frames (light green). A bit of a faff that, but I guess it’s possible with native objects.

Zeros, because any padding happens in the time domain, before the signal gets anywhere near the algorithm (which doesn’t know about padding). You can probably see what the results are internally using framelib (padding vs mirroring)

1 Like

I thought I understood this, but when thinking back about it today, does that mean those edge frames are super pulled up towards Nyquist/2, in the same way that I’m experience in trying to do a larger analysis window with potentially some extra silence at the edges?

Or does the moment there’s a tiny amount of audio in an analysis frame (frame #1 in my spreadsheet/diagram), does that break the “default value”-ness of Nyquist/2?

I’m just confused as what the difference between me “zero-padding” by having silence or near-silence, vs the algorithm zero-padding, if it’s all happening in the time domain.

It’s possible because, as Alex says above, the zeros can introduce a discontinuity that can, in turn, spread a bunch of energy over the DFT spectrum.

None, I would think. But I thought you were doing your own mirroring / folding instead? I’d suggest that FrameLib is the tool for this kind of forensically orientated work (i.e. you want strict timing, and you really care about the initial results from a small population of frames).

(aah, it’s nice to be able to properly quote formatted text again)

Gotcha, and that makes sense.

Out of tangential curiosity, is the Nyquist/2 thing the norm when working with spectral descriptors and silence? It makes sense in that, it’s probably mathematically (?) correct, since all the frequencies are at equal loudness (none).

I should poke at this and see. It’s just hard because I can’t really see/access the actual zero-padding that fluid.spectralshape~ is doing, vs what I can feed it with a larger analysis window. So it could very well be the same results, just hard to see. In running patch above on general demosound sounds, the larger window gives me visible/readable Nyquist/2 at the edges, since I can see them before throwing them away.

I’m still quite green at framelib, so it would take me longer to figure out how to start building something like that, than actually building something like that.

At the moment I’ve not yet done that. The tests I did a while back (above) were conceptually wrong, as I was mirroring the results I got from the analysis, rather than the audio I fed into the analysis. It still did something since there was a stats step that comes next anyways, but it would be (I imagine) very different for actual mirroring of audio.

I think what would take me the least amount of faffing will be to hand craft a couple audio examples and just analyze those static things, rather than working out the parallel/reverse JIT audio analysis which has to concatenate the perfect bits of audio for this all to work.

I’d put a strong vote in for switchable edge behaviours as part of the objects, because although I agree that the current interface allows you to get to lots of options, the process of how you get to them in a musical context is important. I am wary of comparisons between things I’ve made and parts of flucoma (because taking things is not a competition), so what I’m interested in here is the logic, there is a very clear reason (from my perspective as a creator) that I decided to build a bunch of edge conditions into FrameLib objects (even though one could prepare the frame using other objects and post-process) and that is because it is long and tedious work, which I frequently get wrong - and involves lots of diagram type work (as Rod has done here). I absolutely don’t want to deal with that in a musical context when I’m using it, so I hive it off into a technical area of work. There are also not so many options that make sense, so to me it doesn’t make sense to have users replicate that work with the possibility of mistakes, as compared to building It in.

1 Like

I obviously agree here, but I think this is something that can get easily overlooked by someone casually, or even somewhat-deeply using the objects.

I mean, I only really know about the implications of what frames are getting analyzed before/after the chosen frame because of in-person conversations, which most users won’t have had the luxury to have.

Same goes for the zero padding and other “overall edge cases” that happen as part of the system at large.

All of that to say, offering options with friction leads towards no options being offered at all (and default choices being made more often).

As an aside, it would be handy to have something like the bit of code I posted above in a helpfile somewhere to show how to create other edge cases by just choosing larger analysis windows and dropping outer frames.

Revisiting this analysis frame thing today. I’m trying to build a kind of slow-mo envelope generator by using the loudness (and eventually other spectral descriptors) on a frame-by-frame basis so that a single attack can yield a variety of envelope contours, and when unpacking the loudness frames I can never get an envelope shape that seems correct.

So at the moment I’m doing this:

And throwing out the first three and last three frames, so taking frames #4 through #10.

My thinking is that there will be real audio in frame #4, as opposed to being zero-padded.

That ends up giving me attacks that look like this:
Screenshot 2020-10-25 at 7.59.11 pm

And this:
Screenshot 2020-10-25 at 7.59.05 pm

(that’s normalization in dB and then converted to linear amplitude in p normalize)

So with this, the loudest frame I get is frame #7, or frame #4 if I count from the trimmed frames, which corresponds with the analysis frame that is full centered in the analysis window itself.

This, however, is not the loudest frame in the overall window as these are percussive attacks with super fast onsets.

It makes, computational, sense why this is the case as the loudest part of the analysis window is generally preceded by the silence (digital or otherwise) prior to the detected onset. But this isn’t really intuitive as, perceptually I would expect something more along the lines of this:
Screenshot 2020-10-25 at 8.15.12 pm

I’m thinking that mirroring here might give a more (perceptual/intuitive) set of values since it would, presumably, have the highest amount of energy centered around that hop of the analysis window. Or rather, folding, to use @a.harker’s lingo from above.

I think that would require dropping another analysis frame too, so only taking from frame #5, which would be the first frame that would be centered at the start of the analysis window.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Am I thinking about this correctly?
What’s normally done for short/loud analyses?

Up to this point, particularly with summary stats, I’ve been taking the mean of the whole thing here, which, if consistent, should correspond to a comparable descriptor across the board, but if I’m interested in the specific envelope and contour of that short window of time, I feel like there’s probably a better way to approach this than I am presently doing.

You are unlikely to see a shape like the one you are expecting. The system of analysis is symmetric in relation to a ideal impulse, as the important thing is that the analysis is in chunks and thus the impulse either falls under a given window (chunk) or not. It is likely to fall late in the first window under which it falls, and thus be significantly reduced by the windowing. The highest output is where it falls centrally in the window. Overlapping will adjust the “pre-ring” (more overlap gives more frames of “pre-ring” and vice versa).

Also, although the window reduces the level at the edges, at the 25/75 points the level reduction is likely smaller than the effect of the sudden loud transient - hence the relatively flat output overall the peak in dB.

The shape you draw is what you would get from a time domain envelope follower that has been biased to respond faster to attacks (and therefore is non-symmetric). I suppose you using a super short FFT might get you closer to that (but if that then why not use the time domain)?

1 Like

What’s surprising is the highest output (here) is in the middle of the overall analysis window, where the sharpest spike is happening at the start.

Actually, looking at the tiny fragment of audio (256 samples), it doesn’t show much of a morphology:

So this is, I guess, the tiny start of what I drew out. And the contour I’m seeing is probably “correct” in terms of perceptual expectation (like a 2ms “onset”).

I can go for a time-domain thing, for loudness, but the benefit of this would be having it time-aligned with other descriptors for the same window of time. So at x point in time, the loudness/centroid/etc… where [list of values]. I could S&H a time-based envelope follower, but that’d still leave me in a similar boat for spectral analysis. Though folding may be useful there.

I haven’t yet tried smaller hop size (I’m presently using @fftsettings 256 64 512), but I feel like I’m nearest the squeezing water from a rock point with such a tiny analysis window.

I can, perhaps, push this kind of “envelope extraction” onto the predicted analysis stuff, where I’ll (hopefully) have 100ms of audio to analyze. What a luxury!

If you are analysing 256 samples only with a window of 256 samples an a hop of 64 then you would be almost totally guaranteed to have the largest value for the central analysis frame (the one that entirely lines up with the audio), whatever the shape of the audio (for a rectangular window it would always be true, for other shapes there are probably some edge cases).

1 Like

I’ve got lost somewhere in this: you’re talking about fluid.loudness, yes? But this isn’t a spectral-frame based process: it’s a windowed time domain algorithm.

IAC, you might be better off taking the peak-per-window if it’s those sharp little moments you’re after.

1 Like

I did pivot to loudness there, though I guess the same thinking applies to fluid.spectralshape~.

By that last bit, do you mean manually (programmatically) reading through each window/hop to pull a peak value out?

I was wondering about this, but my brain got lost with all the other flip/mirror/hops going on in this thread. Fundamentally, the frame-based processes across the fluid.object~ take a mean of each window of analysis, and then you typically compute an aggregate stat of those core means. Is it a thing to have a different base statistic going on, for descriptors? Like if I want the peak-per-window, as you’re saying here. Or maybe max-per-window, though I guess that makes less sense in a spectral context.

The effects of zero padding will be quite different, because there’s no FFT (so discontinuities don’t smear in the same way)

The second output of fluid.loudness is the peak in the window

I don’t know how useful it is to consider any frame-wise process as simply delivering a mean across the span of the window, because it really depends on the algorithm (e.g. in HPSS you’re definitely not getting a mean, as such).

Ah right! Handy.

I always read that to be another loudness measure variant “true-peak™”. Will check stuff and see how that looks.

edit:
unrelated, but getting this spammed in my window which I’ve never seen before. This is with the fluid.bufloudness~ helpfile:

fluid.bufloudness~: maxWindowSize value (17640) adjusted to power of two (32768)

Alright, this is pretty useful.

Since it’s also a peak, the zero-padding doesn’t matter, so having a larger window for the larger window makes no difference (as far as I can tell). The envelope is also a bit in line with expectations.

Here are a couple random attacks showing all three for comparison:
Screenshot 2020-10-28 at 1.30.30 pm

Screenshot 2020-10-28 at 1.30.16 pm

Screenshot 2020-10-28 at 1.28.46 pm

Screenshot 2020-10-28 at 1.30.08 pm

As an aside, I know one can do whatever they want, but is there a perceptual/theoretical reason why picking peaks like this isn’t useful? As in, if I do the same for offline/realtime (target/source), does it matter?

Or should I still use the mean’d version in that context, but instead pull out peaks like this if I’m trying to generate contours/envelopes?

Perhaps not surprisingly, this “envelope extraction” works better on longer stretches of audio.

Here is 100ms windows (my longer “predicted” bit of analysis). At the moment I’m using the same analysis/fft settings (256 64 512), but with a larger overall analysis window (4410).

In this context, I think the mean-per-frame is better than the peak-per-frame. Also with so many frames, I think the overall gesture/contour is the same, with the peak version just being a bit step-ier.

What’s also somewhat surprising is how the centroid behaves over time. I expected a bit more of a rolloff or some kind of change over that analysis window, but it appears to be largely flat. I checked other spectral descriptors as well (not massively thoroughly), but most show similar (lack of) contours. I guess that’s the nature of this being a single “attack” or “moment”, as opposed to some kind of morphology that changes centroid/timbre over time in a more overt way.

There’s also more meat on these bones as well, which could serve well for the kind of contour/envelope analysis discussed in this thread.

I was thinking about this again today for a couple of reasons. One is how having a pipeline-like workflow might be made a bit more complicated if I’m fussy about what frames I’m taking (since I’d have to scale that up/down with the @numframes), but that it’s not always possible with certain types of analysis.

What I’ve been doing for a bit is what I describe above, where if I’m interested in 7 frames of loudness analysis, actually analyzing 13, and pruning down to 7 so that the overlapped area contains “real” information vs being zero padded. In the tests I did above that seemed to produce quite decent (visual) results.

It then hit me this week, while coding up the LTEp thing that I can’t exactly do that with offline file analysis. Or rather, if the files are sliced tightly there is no “real audio” before the start of the file anyways. So for the sake of consistency between descriptors, I’ve since dropped the my semi-“workaround” for the lack of mirrored frames. Bummer, but I think having more consistent matching/parity is more important than one side of the equation being “more correct”.

Somewhat related, although not the exact subject of this thread, is choosing which frames are analyzed. I’ve always done all the frames for loudness, but then only the “middle” frames for spectral/pitch stuff, based on some discussions from the first plenary. Now that we have weighting, I think that somewhat mitigates the possibility of skewing stats around by funny stuff in the edge frames (though not entirely since loudness also gets zero-pulled). That got me thinking about the scalability and reusability of different analysis pipelines. The demo code in the LTEp patch is hardcoded to 256 samples (7 frames with “middle” frames for spectral stuff). If I request a larger amount of frames, my @startframe 3 @numframes 3 thing doesn’t work anymore. I could programmatically have it such that I always ignore the first and last x amount of frames based on the FFT/overlap, but that’s getting a bit fiddly, and has ramifications for buffer sizes down the pipeline, particularly since things like @weights have to be exactly the right size.

So this is half comment and half question in terms of the current thinking or temperature on being selective with the frames you choose to apply stats to within a window vs letting weighting process that for you.

Don’t get me wrong, I still think having a native @edgeframe option would be handy as I think, for short analysis windows in particular, that having mirrored windowing would be ideal, but for now I’m wondering how useful the results are vs the additional faff.