So looking through Alpha-02, specifically the fluid.nmfmatch~ help file and couldn’t help but think that this would be super useful in a bunch of the things I do.
This may be a simple/stupid question (it’s been a bit since I mess with the FluCoMa files), but is it possible to train the dicts for this via “human sorting”? As in, feeding it a single sound (or representative sounds) and building a dict from that, then doing the same for each sound you want to have it match.
From what I can tell, all of the examples in the help file (including the pretrained piano) require fluid.bufnmf~ to separate the source material into components for you.
I guess the analog would be, instead of having `fluid.bufnmf~’ chug for hours on the piano audio file, if it would have been possible to play in individual notes (A0, A#0, etc…) to build the same dictionary.
read the piano training method with the 1rank pretraining. This is what I do first then do the synthetic long nmf~ but you might have luck with a small synthetic buffer of ranks like I do for my pre-training… give it a whirl and let us know - it should work to some extent.
The source piano audio file isn’t included, but I take it that you are literally telling it to analyze the 3 seconds which correspond to each piano note?
I see, it took me a while to figure out that it’s writing each dict as an individual buffer channel. Looks like double-clicking on buffer~ pno-dicts piano-dicts.wav shows all 89 dicts, but buffer~ pno-dicts only shows one, which confused the shit out of me.
As a mini feature-request, a non-polyphonic version would be handy in the help file as piano/sustained sounds come with additional problems and complications that aren’t applicable in other cases.
you are right about the piano training, I’ve also pushed it on the shared audio files of the forum if you want to hear it.
Have you tried the bird finder patcher with drums instead? A combination of both should help you voice your desires - and then I am happy to help too but I don’t understand your last question (non-polyphonic)… do you mean to know which one it is likely to be more between the N ranks?
I mean more in terms of how fluid.nmfmatch~ responds when being fed sustained sounds (since no component contains the “chords” being played it jumps around wildly).
I’m mainly, and obviously, thinking that this would be great for the training of specific drum sounds and/or their nearest equivalents, and with that also thinking of latency, so being able to tell what was nearest as a single shot would be idea (without having to go into the scheduler while having a separate onset detection algorithm).
I’ll go back and create some recordings of training-suitable drum hits and give it a go. (I did some simultaneous recordings using DPAs and the Sensory Percussion trigger last year, but they were performative/mixed hits)
Ok, so I finally got around to making some recordings and then a patch to analyze with, and I’ve run into some problems I don’t understand.
The idea of the patch is to allow for training 10 different sounds, with the intention of then being to match against them. Unlike the piano example, I’ve built the patch to take individual recordings (ranging from 16s to 56s) to create all the individual dictionaries.
This part appears to be working fine (I think). Where it goes bad is where I try to create the preseed dictionary.
Why are there 89 ranks/buffers for 88 notes?
Why is it analyzing the whole (combined) audio file looking for the ranks?
What should the output of it look like? (without your source audio file it’s not possible to run the training patch)
Here are some additional questions on the patch that don’t deal with things being “broken”.
Why is nmfmatch using different fft settings? (better latency?)
What would be good fft settings for non-pitched drum sounds? (would prefer better latency vs low freq resolution
Is it possible to remove outliers in the dictionary creating stage? (or is it necessary to be very careful with the training data via curation)
How can the matching aspect be optimized for real-time use? (not relying on an external onset detector?, fft settings, iterations and other ML settings which produce better matching results vs initial analysis time)
Responding quickly after a long day, without looking at your patch yet
The idea (that failed in that example) was to create a non-note dict, all the noises and stuff… not worked so well so declared failure.
the idea was to create a single rank per note, then combine them, then re-run the process on the whole thing to allow the algo to tweak itself (segregate more between the ranks) - If your patch works well, you could run the nmfmatch on the seed dicts combined and that should work too…
the output to what? nmfmatch will give you activations (i.e. a ‘volume’ for each dictionary as a list)
it is not supposed to, but more overlap means more time precision?
Try as low as 256 fft 64 hop and see. It all depends on how different your source sounds are I reckon, but I have not tried. It is like any fft process: you will loose bass resolution, but if you are all in the top-mid then you should be fine.
What I did is to make note per note rank 1 dict, and assemble them as a huge multi-rank dict. I don’t understand outliers in that context. Did you check the ‘didactic’ example to get a better understanding of what it does? In the bufnmf you can see different trainings and their influence…
That you have to try. It is the whole point of having a creative coder on the project: try stuff and see how you get on
Ah cool, wasn’t expecting a reply so quickly! Some quick response reponses before going to bed as well.
Ah right, so I can remove the +1-ness of it.
I see. So it just updates the pno-dicts in place? And the audio and acts dicts are just for regular resynthesis stuff?
Sorry, I meant the fluid.bufnmf~ step where it reprocesses all the dicts. When I ran it I got 11 buffers filled with ones, but it could be because I ran it with a completely empty buffer/dict.
So the fft settings should be consistent between the preprocessing steps and real-time matching?
I’ll give that a spin and see. At the moment the sounds I’m using are all very similar. As in, I’m training sounds like “center of snare”, “edge of snare”, “rimshot” etc… There are some differences, but they all acoustic/open snare sounds. If I get this going I’ll then try it on some ‘full kit’ training data and see how it responds.
My training data is very different from yours in that I recorded like 40-50 hits on each section of the drum, and varying volumes, and then analyzed that. So it’s possible that in that there will be some examples that aren’t ideal (like trying to hit 40 rimshots in a row and missing one or two). I can go in and edit those out, but it was more a question of whether when training it, that it would be possible to give it a percentile argument or something.
I get that, it’s just some things are completely opaque in terms of the algorithm and help file (e.g. iterations). Now I can do a bunch of reading up on nmf, but I would be starting from scratch vs you guys that know a bit more of what’s happening.
Ok played with this some more today and got it “working”. And by working I mean that I got some reasonable data out of each of the steps, and got the real-time matching to spit stuff out.
I also made a ‘batch’ version of the patch that allows pointing it to a folder of arbitrary wave files and it will do its thing on them. (patch below)
With this part here. I’m not sure I follow on the steps here. So I processed each file to create individual dicts, then buffcompose~'d them into a single 10-channel file. THEN I ran the fluid.bufnmf~ file on the audio from all of them combined. This produces the audio/envelope buffers just fine, but the dicts appear to be unchanged by this process.
The final, and more critical part is that I can’t see a good way to get the matching algorithm to spit out the closest match in a timely manner. In looking at the matching data some notes/attacks are pretty clear, but others not so much. And if I copy the exact algorithm from your piano patch, waiting 100ms is pretty useless for onset-based processing.
As we’ve (@tremblap and @a.harker) have spoken about a couple of times, it would be great to reverse engineer the Sensory Percussion ML/onset algorithm to be able to train it on arbitrary hits and get it to work without their software (which spits out only MIDI).
ok now I get what you try to do, and I don’t think you will find in the first toolbox, one of decomposition, the machine learning tool for dynamic fast browsing and mapping… As I said in the other thread, we will consider this (black box, copyrighted) object you are trying to reproduce, but I would have a few creative questions for you:
what are you trying to achieve in reverse-engineering a black box that seem to do everything you want it to do? What does it not do that you want to do?
what would you need us to explain better/differently to streamline your learning process? @weefuzzy will soon plough on a refactor of the KE website but it would be good for us to know where you stall. I am sure he (and I and @groma) will study your great forum entries, but if you could voice your pending questions that would be ace.
for your commission, are you not trying to plough a field with a Ferrari? Or to race with a tractor? Anyway, the mandate is to use the tools, with their different features, to stretch your world… but you seem to start from a task which is not completely adapted to the provided tools… 100 ms of delay is only a problem for RT percussion. You can do soooooo many other things with this set of tools! Allow a bit of divergence and see where the latency leads you?
Thanks again for the candid feedback on your usage progress (and the bug reports - you are catching cool exceptions
In a general way, it’s very limited in scope. As in, their specific hardware driving a sampler. So being able to do much more than that would be great. So things like (fast) “onset descriptors”, more broadly useful training and matching algorithms, with some openness and variability. (as in being able to feed it any (real-time) audio and have it return what it thinks it was, along with the relevant distances).
Additionally, because their whole thing is a blackbox, it means it’s not possible to use it in another context (easily), without having shit/MIDI resolution (and presumably additional latency).
Yeah, would be happy to talk about that, and the pedagogy of the toolbox in general. The helpfiles, not surprisingly(!), have been a bit difficult to understand and go through, and with many of the examples I have only understood what’s happening after speaking with someone. The examples are also geared around a specific working process.
My initial idea was not this at all, and actually I still don’t know what I plan on doing with it. From what I was initially thinking (and mentioned at the plenary), the toolset isn’t nearly far along what I had in mind (I had some database stuff in mind).
Most of my ideas are centered around real-time use as that’s a big part of my interest and what I do. The pure (and offline) decomposition stuff is interesting, but doesn’t especially appeal to me as I don’t really “compose” in that manner. Even when I do compose, I don’t work with “sonic materials” that way.
So at the moment, this is just me exploring the tools and seeing what’s possible, with a big slant towards real-time processes. This quite likely won’t be possible, and therefore won’t be part of what I do, but may trigger something else (100ms later anyways…).
( tl; dr: Part of the issue here might just be that the STFT is a hard thing to get working with percussive sounds. But all is not lost )
So, you’re trying to make a classifier. That is, indeed, a kind of machine-learning task, but for these purposes NMF is a kind of machine learning. Whether, in combination with the STFT, it’s the best thing to discriminate drum hits is another matter.
The way the workflow above works is to use the first passes of bufnumf @ rank 1 to create sort of spectral summaries of the things you want to match on. As such, any of the temporal dynamics are sort of washed away, and you’re at the mercy of how well a species of drum strike can be represented by a summary of its magnitude specturm across time. My hunch is that the answer might be ‘not very well’.
So, things we can ponder:
– would a different underying representation of the signal work bettter?
– would you get better differentiation from NMF + STFT if you generate your dictionaries just from the first few milliseconds of a strike?
(As an aside, in Chris Kiefer’s echo state network code, that I made a slightly crappy Max version of a while back, he demonstrates making a drum classifier by training it on raw audio. I suspect this apporach wouldn’t scale hugely well if you wanted a biggish vocabularly, but it’s interested nonetheless. Looks like the only way to play with it is to build from Chris’s repo @ https://github.com/chriskiefer/Fecho , but if you fancy, I’d be happy to compile the PD object for you to experiment with)
That’s good to know. I was hoping to get some good mileage out of that approach, and it was useful to patch around that, but it may not be the direct for this.
Curious what this means? Like some sort of pre-cooking, or something else?
Interesting indeed…
I might try and chop up some audio and train it on that, but is there an upper(/lower) limit and/or sweet spot on how many examples to feed the algorithm? In @tremblap’s example he uses 3 notes per dictionary I think, whereas I went with 30+.
This would be handy, if you wouldn’t mind. It may be (finally, and after much procrastination) to watch the Kadenze videos on ML as I’m not really up on the different algorithms, approaches, and their respective merits and strengths/weaknesses.