Fast and *consistent* onset detection

Ok, so a bit of an odd post here.

As you know I’ve been chasing the onset detection dragon (1, 2, 3) for a while and have gotten settings that work really well and across a wide range of material/dynamics.

Lately I’ve been working on something (with input/help/suggestions from a few friends(Tom Ward, @a.harker , @weefuzzy, and more recently @timlod)) on using four of these DIY sensors on a single drum with the idea of triangulating the time difference of arrival (similar to this NIME paper) to get the absolute position of a strike on the drum.

This is presently working fairly well (though is going out to python for some of the heavy lifting) and is being solved with cross-correlation (using fl.correate~ initially, but something a bit different now).

It also works when using TDOA from the onset detection directly, but I noticed that the “scale” of it was different. As in, the constants from the formula(s) weren’t seemingly working correctly, even though the direction/orientation was still good.

Here’s a vid example where I’m hitting the drum moving towards the point marked “N” in 20mm increments, at first using cross-correlation, then onset-based TDOA:


This is where a related project that I’ve been working on the side with @timlod (Hi!) comes into play. (basically making a large, fairly comprehensive corpus for testing/analyzing drum hits for ML/descriptors)

Tim (@timlod) did some close looking and comparing and saw that the way sound travels through a membrane is kind of inconsistent. Each hit (peak) is preceded by an amount of “pre-ringing” that is directly proportional to the distance to a given sensor. As in, if a strike is close to the sensor, the pre-ring is there but “smooshed”, and if you are further, it is spread out.

The top plot is a hit in the middle of the drum (equidistant from all sensors) where you can see the “pre-ringing” clearly. The lower plot is the hit closest to the north sensor (blue). You can see the same pre-ringing but it happens over ~15 samples vs 50-100 samples for the other sensors (red/orange being west/east and green being south, with the longest pre-ring).

The air microphones (DPA 4099, C214s) show something similar (in terms of pre-ring) but a bit less:

(green is the DPA on the OHL/OHR are C214s about 1m above the drum)

So the reason I’m mentioning this is because amplitude differential onset detection works really well at finding out an onset as happened ASAP, but it struggles to have that point be consistent. So it can tell that an onset has happened (generally somewhere within the pre-ring) but is pretty inconsistent as to where. The variation (on a 14” drum) can be up to 70 samples.

For some more recent testing we’ve tried a different mic array position:

So with this new mic array (sensors basically at ~10/12/2 o’clock) you can see how varied the onset detection is when striking the drum at the center:

When playing really close the sensors the difference between onsets is smaller/closet, as is the pre-ring, so it is “more accurate” here relative to ground truth (ala cross-correlation):

And finally a hit furthest away from the sensors showing the longest pre-ringing:

This is problematic for a few reasons, the most pressing in this use case being that it means the TDOA gets thrown off when using onsets. As in, the results you get from onset-based TDOA vs cross-correlation is quite different, and although they move in a similar direction (as in the vid above) it is really inconsistent in terms of how “accurate” it is.

Secondly, it has big ramifications for small analysis windows full-stop, as the same strike/timbre can be off by 70samples difference in terms of what the subsequent descriptor/MFCC analysis is seeing. I’m doing 7 frames of analysis (256 64, starting window - hop before the onset) which is a bit more than a whole frame of audio variance given on where the strike happens on the drum.


That is why I’ve come to the big brains on here (@tremblap, @a.harker, @weefuzzy in particular), to see if anyone has any thoughts on how this could be improved/mitigated.

Tim and I have both messed with thresholding and Tim has even worked out a separate relative macro threshold which can improve things a bit further, but ultimately it seems very difficult to detect when the onset has happened and having that be consistent relative to different signals.


Some additional thoughts:
-cross-correlation bypasses the problem as it will align the audio regardless as a “ground truth” but the tradeoff here is the additional latency of waiting enough frames to properly cross-correlate the audio gets impractical (and quite fragile as if you are near the edges cc seems to break catastrophically). perhaps there is some room for improvement here (half-windows?)
-there are a couple intended use cases here, some of them using the DIY sensors as above, but another involves a mix of direct sensors and air microphones where, depending on the distance, can add much more time while waiting for an appropriate cc window (you can see this in the example with the C214 OHs above, how much later the audio from those arrive)
-I’ve done some (though not exhaustive) testing with frequency-based onset detection (i.e. fluid.onsetslice~ / fluid.onsetfeature~) and have not gotten better (or even “really good”) results
-perhaps a more sophisticated version of the backtracking that fluid.ampgate~ (@lookback, @lookahead) does may improve things here (at the expense of more/some latency)
-can this “consistent” pre-ringing somehow be leveraged as an assumption in terms of improving onset detection consistency?
-there’s an old (canonical?) Roland patent where they propose determining distance from the sensor by counting the samples between the first zero crossing after an onset is detected. this seems to be consistent with what is being experienced here (basically the “pre-ringing” shrinks as you are closer to a sensor, consistently, and pretty measurably)



Are there ways to improve amplitude-based onset detection such that where the onset is detected in a signal that has variable “pre-ringing” in it is more consistent?

1 Like