Just-In-Time (sort-of) (imperfect) (fun) Polyphonic Classifier

Ok, tested it finally.

I can see your confusion @tutschku. The numbers don’t correspond to the order at all. You have to read through the text on the right of the patch to get it going. Once you add the fluid.bufcompose~, the core patch should work.

I haven’t tested it with acoustic kick/snare/hat yet, but I tested it with much more subtle sounds. (I went to record some new sounds for this last week, but as you know, I can’t really use any USB2 audio interfaces with my new laptop…)

First what the filter dicts look like for the synthetic kick/snare/hat:

This is what the dicts look like with my prepared snare/crotale training set:

The differences are significantly more subtle. I can’t get these to match consistently at all, though I haven’t spent too much time trying to optimize the thresholds or used the jitter part(s) of the patch yet.

Worse yet are the differences between different types of snare hits (center, edge, rimshot, etc…):

Stuff like this is going to be a real problem to differentiate between given how the filters look.

@rodrigo.constanzo did you try it? It does look the same, and all in the low end, so maybe larger FFT size will help

@tutschku you indeed need the real install for it to work. Very few patches I do rely only on one of our objects.

@rodrigo.constanzo another idea is to try longer resonance with higher fft too. Try to make it work… although the idea of classifier is not what this is for really (there are better algo, you can try for instance the full MOOC on machine learning if you are interested on this before our second toolbox, although that would void the point of making you do something creative with the current toolset) - the fast descriptor approach you have done with my JIT code is another option. We have useful descriptors coming with cool stats tools as part of the 1st toolbox too, and I’ll make sure I make such a patch for you :wink:

I’ll play with the FFT size some to see if it looks any better. As well as some longer window times.

I suspect it will never perform as well as the basic kick/snare/hat comparison since those are quite different.

I’ll also try it with the direct audio from the sensor to see if that makes any difference.

A combination of the onset descriptors and this might be interesting. Not sure how that would work, but worth exploring.

Also curious what the descriptor stuff you guys are bringing forward will be (and how it will differ from @a.harker’s stuff).

What I discovered with @groma experience is that how to deal with them statistically changes everything. We’ll have a bunch of examples to explain that I reckon.

1 Like

Ok, tested things further.

First I made new filter dicts using a bigger FFT size (512/64). That gave me these (obviously more detailed) filters:

Prepared snare/crotales:

Snare hits:

The bigger FFT size still didn’t work too well with the normal thresholding used in the top level patch.

BUT it worked pretty consistently with the jitter processing (jit.3m -> zl sort -> zl slice 2), on both sample sets. It also worked well with the normalised relative value maximum-ing.

So with that, I went back to the FFT size of 126/64 and was able to get usable results with the jitter means on both sample sets (even the similar snare samples!). With the smaller FFT size, the normalised relative value didn’t work too well though.

I guess it’s a matter of fine tuning the post-processing to work quickly and consistently even different sample sets.

In revisiting some of the posts the thread about pre-processing for NMF, I came across the filter comparisons between pre bug-fix @filterupdate 1.

These are the raw filter dicts that were created from training multiple hits:

And then these are the ones from when I ran the process again on a pre-seed dict with @filterupdate 1:

The difference is pretty huge.

I’m wondering if something like this can be incorporated into how you are creating the initial dicts. Since you updating each hit as you update it, you never run it on a pre-seed bit of performance audio. I get really confused with this side of things as the averaging/normalising/pre-seeding gets very close to voodoo for me.

Do you think that would be useful?

With the code I provided, you can do it yourself. At this point, I’ve chewed a lot for you, but you need to try a bit more how to update dicts. For instance, and without having tried, you could have a refining process, in which you train on one of each class in one pass… like the seminal piano example I provided (but with smaller number of classes)… but again, you are trying a lot to do something it is not meant to do, instead of trying to do stuff with it it can do… nmf as classifier will be slow, creative, and divergent, as it tries to combine the 3 dicts to make the sound you try to match. Think of it as a division.

I’ll try adding a refining pass and see how that goes.

The results do seem more promising than what I was doing before, but I may abandon this line of inquiry altogether (for what I want to do with it) and focus more on ‘onset descriptors’ approach since it doesn’t require having classes and such predefined.

Either way I’ll test it more and make a patch that does all that.

I thought that was the whole point :wink:

indeed, so please continue and fight! It is fun to see where you will go. Abandoning nmf because it is not perfect is a bit sad though, since there are many other things you could try to map to that approximate polyphonic classification (which you would not get otherwise I think, but here I might not be right… I’m still learning myself!)

I’ll definitely keep testing and will post my results.

Not in general, but in terms of what I was thinking specifically for the performance.

Ok, did some testing with acoustic kit sounds.

I recorded everything with a DPA 4060 magnetically clipped to the snare, kind of above the kick.

I hit each drum at a variety of dynamics and created sets with snares on and off.

Here is a “hybrid” of kick being trained with snares off, and snare being trained with snares on:

The differentiation between kick and snare being subtle came as a surprise. I imagine this is due to the FFT size (128/64). The hihat also came as a surprise. I initially recorded hihat hits that went from tightly closed to semi-open, and when I saw the dict for the hat, I used only from tight to closed, leaving out the semi-open hits for another set.

I then trained everything with the snares on. I don’t know if it’s because my playing was different, but the kick looks more differentiated here:

Next I tried training everything using a “complete” set. As in, training the kick with snares off and on, then doing the same for the snare. I also included the complete set of hits for the hat:

Surprisingly the snare didn’t change much. And the kick is quite similar to the kick being trained with snares on only. The hat looks a bit better here, less “broadband”.

And finally I created a preseed-refined dict. I used the “complete” set from above, and then ran @filterupdate 1 with a recording of me playing a beat with the same recording setup:

This looks the best, with clear differences between the hits, which makes sense.

In terms of how they track and sound. None of them worked super great (with the default ‘polyphonic’ patch). I didn’t spend a long time with each, as my focus/intent for today was just to create the dicts which I could then go back and more thoroughly test and compare. So that being said, I only really played back a bit of my recorded beat and messed with the thresholds at the bottom of the patch. It was difficult to find a setting that got minimal cross talk between kick and hat with the snare still working.

I also didn’t (thoroughly) check the the jitter (monophonic) activations to see if those performed better, which was definitely the case with the other acoustic sounds I tested the system with. There’s definitely something to be said for testing/trying it with acoustic sounds as the differentiation is much more subtle than when using the synthetic counterparts.

I’m attaching the dicts I created along with a shorter section of the “test beat” if anyone else wants to give it a spin.

trained dicts.zip (5.2 KB)

beat for testing.wav.zip (758.1 KB)

my heart gave 2 turns when I read this: I’m so proud :wink:

the way I approached that was to look at values I get when I use the other 2 classes separate and at the same time (values in BD for a hit on SN, on HH, and on SN+HH) as my crosstalk, and set just above that. Do the same for the other 2 classes, and it was quite good in polyphonic mode - but indeed artificial sounds are more consistent… I would make the filters larger (256 if you can spare the extra 2ms) :wink:

Hehe, learning.

I’ll make some comparisons with everything, but here are what the 256 sized dicts look like.

The “complete” set:

And the preseed-ed version (@iterations 5000 now since I’m in no rush…(took almost 8min)):

Not massively different, particularly at the low end. There’s no hipass or anything (on the recording/analysis side).

In setting the thresh-es, I can get ok responses out of kick and snare, but it’s hard to get a good bass drum (that doesn’t also trigger the snare).

I presume you get so much snare resonance (mechanical) in the mic… have you tried on the snare, with the special captor, for different techniques?

I record everything on both mics. This is what the special sensor audio looks like for the snare (but normal audio for kick/hat, as the sensor doesn’t really pick those up at all):

A bit more resonance on the snare.

I actually recorded audio from a normal contact mic drum trigger on the kick drum too, so I can manually tell the difference between what’s being played, but it’d be useful to figure out how to do it with purely audio analysis.

so the idea is to have 2 processes in parallel: BD and HH as 2 classes on air mic, and SN (many timbre) on the other input. That would be optimal, no?

Optimally I could even put a contact mic on the bottom of the hihat (something I’ve thought about anyways) and then there’s no doubt as to what’s coming from where. That’s something I’ll probably do as a broader setup, but part of my exploration here isn’t to necessarily differentiate between those sounds as they are pretty far apart in terms of sound (even though it’s not as clear cut in terms of the classification).

The reason I want to do it all with an air mic is that what would be more useful to do would be to train it on various prepared snare sounds, and have it tell what I did. That’s something the fancy sensor fails at, as it’s mechanically coupled, and is trained on a finite amount of unmodified sounds. So using preparations and such will fall outside of its scope.

ok what about an hybrid: you use the sensor to detect which source to process, and then the airmic to nuance?

That’s where the onset descriptors comes in :slight_smile:

I did some recordings to test stuff further too, to see if I can get to the bottom of the loudness discrepancies, and to also test out the pitch descriptor effectiveness, though I long for some descriptors-land sigmund-ing.

1 Like